- A large Gates Foundation-funded clinical trial found that a generative AI support tool used by frontline clinicians in Kenya was safe and improved clinical decision-making, but did not significantly improve short-term patient outcomes.
- The Nature Medicine study involved more than 9,600 patients across 16 primary care clinics and tested AI Consult, a large language model-based tool embedded in an electronic medical record system.
- Researchers found no statistically significant difference in 14-day treatment failure between AI-supported care and standard care, though the AI tool improved documentation and treatment planning and lowered antibiotic-related costs.
A Gates Foundation-funded clinical trial found that a generative AI tool used by frontline clinicians in Kenya was safe and improved clinical decision-making, but did not significantly improve short-term patient outcomes.
According to the researchers from the University of Birmingham, the study, published in Nature Medicine, is one of the first randomized controlled trials to test whether generative AI can improve patient-level outcomes in real clinical care. Most previous tests of health AI tools have focused on simulated cases, technical performance or clinician behavior, rather than whether patients actually fare better.
The university said the trial included more than 9,600 patients across 16 primary care clinics in Kenya. It was led by researchers at the university, supported by the National Institute for Health and Care Research Biomedical Research Centre: Birmingham, sponsored by PATH and conducted with collaborators from the London School of Hygiene and Tropical Medicine and the KEMRI-Wellcome Trust Research Programme in Kenya.
Researchers tested an AI system known as AI Consult. The tool was built into the clinics’ existing electronic medical record system and used a large language model, the same broad class of technology behind generative AI chatbots. In this case, the model was designed to assist clinicians during consultations by offering diagnostic and treatment suggestions based on the information entered into the medical record.
Clinicians were randomly assigned to use the electronic medical record system either with or without the AI tool. During patient visits, AI Consult worked in the background. It analyzed information entered by the clinician, generated diagnostic and treatment suggestions aligned with Kenyan national clinical guidelines and flagged possible concerns through a simple green, yellow or red alert system.
The clinicians remained in charge of care, researchers stressed, and were not required to follow the AI tool’s suggestions and they retained responsibility for diagnosis, prescribing and referral decisions. Patients did not see the AI interface, a design choice intended to preserve normal patient-clinician interaction.
The FIndings
Researchers reported that the main finding was mixed. The AI-supported care did not produce a statistically significant improvement in treatment failure within 14 days. Treatment failure occurred in 2.2% of patients in the AI group and 2.0% of patients receiving standard care.
The study also found no evidence that the tool caused harm. Hospitalization and death rates were similar in both groups.
Researchers said that finding matters because safety is one of the central questions around the use of generative AI in health care. Large language models can produce fluent but incorrect recommendations, and clinical settings leave little room for untested systems. In this trial, the researchers found that the tool could be integrated into daily primary care without undermining patient trust or clinician autonomy.
The study haa relevance beyond Kenya, including for high-income health systems, researchers said. Still, they cautioned that generalizability needs to be evaluated.
“What we found is reassuring but also sobering,” noted senior author Professor Bilal Mateen, Honorary Professor of Machine Learning for Health at the University of Birmingham, and Chief AI Officer at PATH. “The technology appears safe and clearly improves aspects of clinical decision-making, but translating those gains into measurable patient benefit is much more challenging, particularly in everyday primary care.”
The trial did show gains in the quality of clinical work with an independent panel of experienced clinicians, blinded to whether AI had been used, found that AI-supported visits had significantly better clinical documentation and treatment planning.
Additionally, patient satisfaction was the same in both groups. The researchers said this suggests the AI system did not noticeably change the patient experience, despite being active during the consultation.
The study also found a difference in antibiotic-related costs as overall antibiotic prescribing rates were similar between the two groups, but antibiotic costs were lower in the AI-supported group. The researchers attributed that to more cost-conscious prescribing choices.
The Study’s Limitations and Potential Impact
A major limitation is that serious outcomes are rare in primary care. Hospitalization and death happen infrequently in that setting, which makes it difficult for even a large trial to detect modest changes. The researchers said studies involving more than 100,000 patients may be needed to identify smaller effects on serious outcomes.
“A large part of primary care is to deal with common conditions, including those that are self-limiting, where many patients require low levels of healthcare intervention,” added Professor Alastair Denniston, co-author, Professor of Regulatory Science and Innovation at the University of Birmingham and lead for health data research at the NIHR Biomedical Research Centre: Birmingham. “In that context, even meaningful improvements in clinical reasoning may only result in small changes in patient outcomes that are very difficult to measure.”
That point is important for health systems weighing AI investments since a tool may help clinicians make better decisions, document cases more clearly or prescribe more efficiently without quickly producing lower hospitalization or death rates. The researchers said the study suggests that patient outcomes may be the hardest test for generative AI in medicine, not because the tools have no value, but because the effects may be subtle, indirect or dependent on the setting.
“Robust trials like this are so important to establish the real impact of using AI in practice,” said Professor Richard Riley, Professor of Biostatistics at the University of Birmingham and senior author. “They help set realistic expectations of what AI can actually contribute within existing care pathways, and helps guide where future investment and research effort should be focused. Generalisability of our findings to higher-income settings, where baseline standards of care are already high, needs to be evaluated.”