ChatGPT downplayed more than half of cases that required emergency care
Researchers at the Icahn School of Medicine at Mount Sinai found that ChatGPT Health does not always align its medical reasoning with the advice it gives patients.
Across test cases, the system correctly identified signs of serious clinical risk in its written analysis but advised patients to wait or seek routine outpatient evaluation. In one scenario, a patient presented with symptoms consistent with early respiratory failure. ChatGPT Health described the danger in its explanation but incorrectly advised the patient to wait.
Published in Nature Medicine, the evaluation offers the first independent look at how OpenAI’s consumer health tool performs in triage situations since its January 2026 launch. The study found that ChatGPT Health under-triaged 52% of cases that physicians agreed required emergency care.
The evaluation tested 60 clinician-authored patient vignettes spanning 21 medical specialties. Researchers ran those cases under 16 contextual conditions, producing 960 total model interactions. Three independent physicians established the appropriate triage level for each scenario using guidelines from 56 medical societies.
The model’s explanatory text recognized clinically dangerous findings while its final recommendation directed users toward a 24-to-48-hour outpatient evaluation instead of emergency care. The system appeared to understand the risk but did not escalate its advice.
“ChatGPT Health performed well in textbook emergencies such as stroke or severe allergic reactions,” said Ashwin Ramaswamy, MD, an instructor of urology at Mount Sinai and the study’s lead author. “But it struggled in more nuanced situations where the danger is not immediately obvious, and those are often the cases where clinical judgment matters most.”
When symptoms clearly signaled life-threatening emergencies, the model typically directed users to seek urgent care. Lower-risk scenarios produced a different pattern. Among nonurgent presentations, ChatGPT Health over-triaged roughly 65% of cases, recommending physician visits when home care would have been sufficient.
When reasoning and recommendations diverge
ChatGPT Health is designed to surface a link to the 988 Suicide and Crisis Lifeline in high-risk situations. In the Mount Sinai tests, those alerts appeared inconsistently. They sometimes triggered in lower-risk scenarios while failing to activate when users described specific plans for self-harm.
Girish N. Nadkarni, MD, chief AI officer of the Mount Sinai Health System and the study’s senior author, said the pattern effectively inverted expected clinical risk thresholds. Compounding this risk is the fact that patients seeking urgent advice are more likely to take the final directive at face value as opposed to critically analyzing the reasoning that led to it.
On top of that, ChatGPT Health was easily persuaded by friends and family members. If they minimized the situation, the model was more likely to follow suit.
As use grows, oversight remains unclear
OpenAI reported within weeks of launch that approximately 40 million people were using ChatGPT Health daily, including for triage-related questions. Roughly 70% of those health conversations occur outside normal clinic hours, and more than 580,000 weekly messages originate from communities located more than 30 minutes from a hospital.
An OpenAI spokesperson said the company welcomed the research but argued that the evaluation did not reflect how ChatGPT Health is designed to function. The tool is intended for multi-turn conversations in which users provide additional context through follow-up questions rather than relying on a single prompt for triage guidance.
The Mount Sinai researchers acknowledged the system’s multi-turn design. They also noted that the vignettes used in the study provided standardized and clearly written clinical inputs. If the tool under-triages when given structured information, the authors wrote, performance is unlikely to improve when real patients provide incomplete or contradictory descriptions of symptoms.
The evaluation reflects a single snapshot of model behavior, and as the system evolves through updates, the researchers argue that independent monitoring will be necessary to track safety performance over time.
Regulatory oversight remains unsettled. On January 6, 2026, one day before ChatGPT Health launched, the FDA revised guidance covering clinical decision support software and general wellness devices, loosening oversight requirements for certain AI-enabled products. The update did not address consumer-facing AI health chatbots directly.
Tools like ChatGPT Health now provide triage-level guidance to millions of users without a clinician in the loop while remaining outside the FDA’s clarified oversight framework.
The Mount Sinai team published the study’s full dataset, including all 960 prompts and model responses, on Zenodo, providing a detailed public record of the system’s responses to emergency medical scenarios.