MIT Study Shows LLMs Factor in Unrelated Information When Recommending Medical Treatments

Insider Brief

  • A new MIT study presented at the ACM Conference on Fairness, Accountability, and Transparency, finds that large language models used in health care can make flawed treatment recommendations when exposed to nonclinical text variations such as typos, informal language, and missing gender cues.
  • The study tested four models, including GPT-4, using altered patient messages that preserved clinical content but mimicked realistic communication styles, revealing a 7–9% increase in erroneous self-care advice, particularly for female patients.
  • Researchers call for stricter audits and new evaluation benchmarks before LLMs are deployed in clinical settings, warning that models trained on sanitized data may falter under real-world patient interaction scenarios.

Large language models used in health care may mistakenly tell patients to stay home instead of seeking treatment based on typos or conversational phrasing in their messages, according to a new study from MIT.

Presented at the ACM Conference on Fairness, Accountability, and Transparency, the research underscores the risks of deploying AI systems trained on sanitized data into messy real-world medical environments.

According to MIT, the research team, led by Marzyeh Ghassemi and Abinitha Gourabathina, found that nonclinical variations—such as extra white space, slang, and missing gender cues—can push models like GPT-4 to recommend that a patient self-manage a condition, even when the correct advice would be to seek medical care.

“These models are often trained and tested on medical exam questions but then used in tasks that are pretty far from that, like evaluating the severity of a clinical case. There is still so much about LLMs that we don’t know,” Gourabathina, EECS graduate student and lead author of the study, noted.

This shift was seen across four models, including the large, commercial model GPT-4 and a smaller LLM designed for medical settings, according to researchers.

To conduct the study, the researchers modified thousands of patient notes to simulate realistic communication patterns, including uncertainty, health-related anxiety, limited language proficiency, and low digital literacy. They preserved all clinical information while tweaking the style and grammar of the notes. The LLMs were then asked three questions: whether the patient should manage care at home, seek a clinic visit, or receive additional medical resources. Their recommendations were compared against those from human clinicians.

According to MIT, results showed a 7 to 9 percent increase in erroneous suggestions for home care when the text was “perturbed.” The most impactful changes came from dramatic language or slang. Across the board, the models deviated in their judgment from doctors, particularly in patient-facing tasks like triage.

Notably, the study found that these inconsistencies disproportionately affect female patients. In test cases with gender-neutral content, models still made more errors for women, often advising against medical visits when they were warranted.

The study also highlights that standard model evaluation benchmarks often miss these subtle but critical errors. While LLMs may appear accurate when tested on medical exam questions, their performance degrades under more ambiguous or colloquial input—the kind routinely seen in real clinical communication. The fragility of these systems is especially pronounced in chatbot-like applications, where AI interacts directly with patients.

“In our follow up work under review, we further find that large language models are fragile to changes that human clinicians are not,” noted Ghassemi, an associate professor in the MIT Department of Electrical Engineering and Computer Science (EECS) and senior author of the study. “This is perhaps unsurprising — LLMs were not designed to prioritize patient medical care. LLMs are flexible and performant enough on average that we might think this is a good use case. But we don’t want to optimize a health care system that only works well for patients in specific groups.”

The authors argue for more rigorous audits of AI systems before clinical deployment. They also call for new benchmarks that test models under realistic and diverse input conditions, especially those reflecting vulnerable populations. Future work will explore other biases and extend perturbation strategies to better reflect patient demographics and communication styles.

Greg Bock

Greg Bock is an award-winning investigative journalist with more than 25 years of experience in print, digital, and broadcast news. His reporting has spanned crime, politics, business and technology, earning multiple Keystone Awards and a Pennsylvania Association of Broadcasters honors. Through the Associated Press and Nexstar Media Group, his coverage has reached audiences across the United States.

Share this article:

AI Insider

Discover the future of AI technology with "AI Insider" - your go-to platform for industry data, market insights, and groundbreaking AI news

Subscribe today for the latest news about the AI landscape