AI Research & Advances

MIT Study Shows LLMs Factor in Unrelated Information When Recommending Medical Treatments

Insider Brief

A new MIT study presented at the ACM Conference on Fairness, Accountability, and Transparency, finds that large language models used in health care can make flawed treatment recommendations when exposed to nonclinical text variations such as typos, informal language, and missing gender cues.
The study tested four models, including GPT-4, using altered patient messages that preserved clinical content but mimicked realistic communication styles, revealing a 7–9% increase in erroneous self-care advice, particularly for female patients.
Researchers call for stricter audits and new evaluation benchmarks before LLMs are deployed in clinical settings, warning that models trained on sanitized data may falter under real-world patient interaction scenarios.

Large language models used in health care may mistakenly tell patients to stay home instead of seeking treatment based on typos or conversational phrasing in their messages, according to a new study from MIT.

Presented at the ACM Conference on Fairness, Accountability, and Transparency, the research underscores the risks of deploying AI systems trained on sanitized data into messy real-world medical environments.

According to MIT, the research team, led by Marzyeh Ghassemi and Abinitha Gourabathina, found that nonclinical variations—such as extra white space, slang, and missing gender cues—can push models like GPT-4 to recommend that a patient self-manage a condition, even when the correct advice would be to seek medical care.

“These models are often trained and tested on medical exam questions but then used in tasks that are pretty far from that, like evaluating the severity of a clinical case. There is still so much about LLMs that we don’t know,” Gourabathina, EECS graduate student and lead author of the study, noted.

This shift was seen across four models, including the large, commercial model GPT-4 and a smaller LLM designed for medical settings, according to researchers.

To conduct the study, the researchers modified thousands of patient notes to simulate realistic communication patterns, including uncertainty, health-related anxiety, limited language proficiency, and low digital literacy. They preserved all clinical information while tweaking the style and grammar of the notes. The LLMs were then asked three questions: whether the patient should manage care at home, seek a clinic visit, or receive additional medical resources. Their recommendations were compared against those from human clinicians.

According to MIT, results showed a 7 to 9 percent increase in erroneous suggestions for home care when the text was “perturbed.” The most impactful changes came from dramatic language or slang. Across the board, the models deviated in their judgment from doctors, particularly in patient-facing tasks like triage.

Notably, the study found that these inconsistencies disproportionately affect female patients. In test cases with gender-neutral content, models still made more errors for women, often advising against medical visits when they were warranted.

The study also highlights that standard model evaluation benchmarks often miss these subtle but critical errors. While LLMs may appear accurate when tested on medical exam questions, their performance degrades under more ambiguous or colloquial input—the kind routinely seen in real clinical communication. The fragility of these systems is especially pronounced in chatbot-like applications, where AI interacts directly with patients.

“In our follow up work under review, we further find that large language models are fragile to changes that human clinicians are not,” noted Ghassemi, an associate professor in the MIT Department of Electrical Engineering and Computer Science (EECS) and senior author of the study. “This is perhaps unsurprising — LLMs were not designed to prioritize patient medical care. LLMs are flexible and performant enough on average that we might think this is a good use case. But we don’t want to optimize a health care system that only works well for patients in specific groups.”

The authors argue for more rigorous audits of AI systems before clinical deployment. They also call for new benchmarks that test models under realistic and diverse input conditions, especially those reflecting vulnerable populations. Future work will explore other biases and extend perturbation strategies to better reflect patient demographics and communication styles.

Need Deeper Intelligence on the AI Market?

AI Insider's Market Intelligence platform tracks funding rounds, competitive landscapes, and technology trends across the global AI ecosystem in real time. Get the data and insights your organization needs to make informed decisions.

AI, AI Funding & Investment, AI Policy & Regulation, Robotics

US Lawmakers Introduce Legislation to Establish National Robotics Strategy, Regulate Robotics From China

Insider Brief Lawmakers in the U.S. House and Senate have introduced separate bipartisan robotics bills that would establish a national robotics strategy and increase scrutiny

AI, AI Funding & Investment, AI Research & Advances, Robotics, Uncategorized

1X Launches World Model Lab to Advance Humanoid Robot Autonomy

Insider Brief Humanoid robotics maker 1X has launched a new lab focused on developing AI world models to “to pretrain on the most important data

AI, AI Funding & Investment, Exclusives

The 20 AI Agent Platform & Framework CEOs You Need to Know in 2026

Every enterprise, from a seed-stage startup deploying its first automated workflow to a Fortune 50 firm rebuilding its entire labor model, now depends on agent