Insider Brief
- A new study funded by Mount Sinai reveals that advanced AI models can make basic errors in complex medical ethics scenarios, raising concerns about the reliability of large language models (LLMs) in clinical decision-making.
- Published in NPJ Digital Medicine and conducted with Rabin Medical Center, the research found that AI models often default to familiar answers, even when those answers are incorrect, underscoring the need for human oversight in high-stakes healthcare applications.
- The Mount Sinai-led team plans to expand testing across broader clinical scenarios and establish an “AI assurance lab” to evaluate AI performance on real-world medical complexities.
A new study funded by the Icahn School of Medicine at Mount Sinai highlights potential risks in deploying artificial intelligence in health care, showing that advanced AI systems can make basic mistakes when navigating complex medical ethics scenarios.
“AI can be very powerful and efficient, but our study showed that it may default to the most familiar or intuitive answer, even when that response overlooks critical details,” noted co-senior author Eyal Klang, MD, Chief of Generative AI in the Windreich Department of Artificial Intelligence and Human Health at the Icahn School of Medicine at Mount Sinai. “In everyday situations, that kind of thinking might go unnoticed. But in health care, where decisions often carry serious ethical and clinical implications, missing those nuances can have real consequences for patients.”
Published in NPJ Digital Medicine, the research was conducted in collaboration with Rabin Medical Center in Israel and other partners.
According to the researchers, the findings raise important concerns about the reliability of even the most advanced large language models (LLMs) in clinical settings, particularly when ethical sensitivity and nuanced reasoning are required. Researchers designed experiments to test how AI models handle modified versions of well-known ethical dilemmas. Their methods drew from the behavioral theories in Daniel Kahneman’s book “Thinking, Fast and Slow,” exploring whether AI tools default to fast, familiar responses or engage in slower, more analytical thinking.
The research team tested several commercially available LLMs using altered lateral thinking puzzles and reworked medical ethics scenarios. In one test, a classic case involving a boy and a surgeon was modified to explicitly remove ambiguity. Even so, several AI models offered incorrect answers, suggesting the models relied on familiar patterns rather than new information. In another case, AI models responded incorrectly to a scenario about parental consent for a blood transfusion, despite the question clearly stating that consent had already been given.
The study’s implications center on the role of human oversight. While AI tools promise to enhance productivity in medical decision-making, the findings from Mount Sinai and Rabin Medical Center indicate that physicians should exercise caution, especially in ethically sensitive or high-stakes cases. The researchers emphasized that AI should serve as a complement to clinical expertise, not a replacement.
“Simple tweaks to familiar cases exposed blind spots that clinicians can’t afford,” lead author Shelly Soffer, MD, a Fellow at the Institute of Hematology, Davidoff Cancer Center, Rabin Medical Center, pointed out. “It underscores why human oversight must stay central when we deploy AI in patient care.”
Looking ahead, the Mount Sinai-led team plans to broaden its evaluation to include a wider range of clinical scenarios. They also aim to establish an “AI assurance lab” to systematically assess how different AI models respond to real-world complexities in health care decision-making.
Funding from Mount Sinai, Rabin Medical Center, and partner institutions supported this research, with findings authored by Shelly Soffer, MD; Vera Sorin, MD; Girish N. Nadkarni, MD, MPH; and Eyal Klang, MD.




