MIT Study Shows LLMs Factor in Unrelated Information When Recommending Medical Treatments

Insider Brief

  • A new MIT study presented at the ACM Conference on Fairness, Accountability, and Transparency, finds that large language models used in health care can make flawed treatment recommendations when exposed to nonclinical text variations such as typos, informal language, and missing gender cues.
  • The study tested four models, including GPT-4, using altered patient messages that preserved clinical content but mimicked realistic communication styles, revealing a 7–9% increase in erroneous self-care advice, particularly for female patients.
  • Researchers call for stricter audits and new evaluation benchmarks before LLMs are deployed in clinical settings, warning that models trained on sanitized data may falter under real-world patient interaction scenarios.

Large language models used in health care may mistakenly tell patients to stay home instead of seeking treatment based on typos or conversational phrasing in their messages, according to a new study from MIT.

Presented at the ACM Conference on Fairness, Accountability, and Transparency, the research underscores the risks of deploying AI systems trained on sanitized data into messy real-world medical environments.

According to MIT, the research team, led by Marzyeh Ghassemi and Abinitha Gourabathina, found that nonclinical variations—such as extra white space, slang, and missing gender cues—can push models like GPT-4 to recommend that a patient self-manage a condition, even when the correct advice would be to seek medical care.

“These models are often trained and tested on medical exam questions but then used in tasks that are pretty far from that, like evaluating the severity of a clinical case. There is still so much about LLMs that we don’t know,” Gourabathina, EECS graduate student and lead author of the study, noted.

This shift was seen across four models, including the large, commercial model GPT-4 and a smaller LLM designed for medical settings, according to researchers.

To conduct the study, the researchers modified thousands of patient notes to simulate realistic communication patterns, including uncertainty, health-related anxiety, limited language proficiency, and low digital literacy. They preserved all clinical information while tweaking the style and grammar of the notes. The LLMs were then asked three questions: whether the patient should manage care at home, seek a clinic visit, or receive additional medical resources. Their recommendations were compared against those from human clinicians.

According to MIT, results showed a 7 to 9 percent increase in erroneous suggestions for home care when the text was “perturbed.” The most impactful changes came from dramatic language or slang. Across the board, the models deviated in their judgment from doctors, particularly in patient-facing tasks like triage.

Notably, the study found that these inconsistencies disproportionately affect female patients. In test cases with gender-neutral content, models still made more errors for women, often advising against medical visits when they were warranted.

The study also highlights that standard model evaluation benchmarks often miss these subtle but critical errors. While LLMs may appear accurate when tested on medical exam questions, their performance degrades under more ambiguous or colloquial input—the kind routinely seen in real clinical communication. The fragility of these systems is especially pronounced in chatbot-like applications, where AI interacts directly with patients.

“In our follow up work under review, we further find that large language models are fragile to changes that human clinicians are not,” noted Ghassemi, an associate professor in the MIT Department of Electrical Engineering and Computer Science (EECS) and senior author of the study. “This is perhaps unsurprising — LLMs were not designed to prioritize patient medical care. LLMs are flexible and performant enough on average that we might think this is a good use case. But we don’t want to optimize a health care system that only works well for patients in specific groups.”

The authors argue for more rigorous audits of AI systems before clinical deployment. They also call for new benchmarks that test models under realistic and diverse input conditions, especially those reflecting vulnerable populations. Future work will explore other biases and extend perturbation strategies to better reflect patient demographics and communication styles.

Need Deeper Intelligence on the AI Market?

AI Insider's Market Intelligence platform tracks funding rounds, competitive landscapes, and technology trends across the global AI ecosystem in real time. Get the data and insights your organization needs to make informed decisions.

Related Articles

AI Insider
Guest Post: AI Startup GTM Strategy: Why Trust — Not Reach — Determines Who Reaches $25M

Guest Post By Mark M.J. Scott President of Northern Pixels Almost every AI founder I’ve met in the last few years shares a version of the

Factory Raises $150M at $1.5B Valuation to Scale AI Coding Agents for Enterprises

Factory has secured $150 million in funding at a $1.5 billion valuation to expand its AI-driven coding platform for enterprise engineering teams. The round was

Upscale AI Targets $2B Valuation in Latest Funding Talks Despite No Product Launch

Upscale AI is reportedly in discussions to raise between $180 million and $200 million in a new funding round that could value the company at

Stay Updated with AI Insider

Get the latest AI funding news, market intelligence, and industry insights delivered to your inbox weekly.

Subscribe today for the latest news about the AI landscape