One pill, three chatbots, four answers: Why AI models get simple healthcare questions wrong
The label was correct, but the pills looked different than usual. I was expecting brown and oblong, but these were round and white.
So I did what any enterprising digital health journalist/patient empowerment advocate/neurotic human with health anxiety would do and sent ChatGPT a photo of the pills in my hand.
“What drug is this?” I asked, and the reply came swiftly and confidently.
“Based on the shape of the pill and the imprint of I|G on one side and 208 on the other, this matches Ibuprofen 800 mg, commonly manufactured by InvaGen Pharmaceuticals.”
Fantastic. A clean and simple answer. Except I don’t have a prescription for ibuprofen 800mg. I’ve got a prescription for citalopram 40mg (because…you know…everything). #endmentalhealthstigma
“Wow,” I thought. “ChatGPT might not have saved my life in this instance, but it probably saved my stomach lining. This will make a really catchy LinkedIn post about how AI can help improve medication adherence and avoid adverse events outside of the clinic.”
I was already racking up the likes and shares in my head before I paused for a second.
Hadn’t I just written this article about ChatGPT’s blind spots in diagnostics and triage? And this one about how AI can sometimes mishandle authoritative-sounding health claims? And this one about patients not being great at using AI to make health decisions?
What if ChatGPT was wrong about how it identified the pills? Now that would be an even better tale for LinkedIn.
So, I turned to Google, the venerable gateway to the internet, which now offers extensive AI summaries at the top of the page. I didn’t upload the image of the pills, but I did ask, “What pills have IG on one side and 208 on the other?”
“Citalopram 40mg, of course,” it replied. “No doubt about it.”
The plot thickens.
I went back to ChatGPT. “Google is wrong,” it said, with what I’m sure wasn’t a smug tone at all. “Citalopram tablets in the US do not use the IG / 208 imprint. Google’s AI summary probably mis-associated the imprint without cross-validating with other sources.”
“This is exactly the kind of failure we sometimes see in LLMs,” added the pot about the kettle.
All right. Game on. I had to find out…what does Claude have to say?
After being unable to recognize that the front and the back of the pill were two sides of the same medication, it told me the pills were probably Atenolol 25mg (a beta blocker), but it couldn’t tell from the picture. I sent it another image, and it decided it could be Ibuprofen, but a 200mg dose instead of 800mg.
Now I’ve got a real quandary on my hands. Three models, four answers about a single pill, and at least one model willing to defend itself by assassinating the character of its competitor. What’s a confused patient to do?
To be fair, all of the models said not to take any weird pills and suggested I consult with my pharmacist as soon as possible, which is exactly what I did.
I was a little worried about doing so. I have a wonderful, independently owned neighborhood pharmacy that’s never done me wrong, and I really didn’t want this to be a story about how a fallible human in a failing system made a dispensing error and we should all start using AI to handle our medications instead.
Happily, that’s not the case. My pharmacist checked the pills, checked the label, listened to an abbreviated version of my story, and nodded. She went into the back and brought out the bottle from which she had dispensed my pills, shook a few out into the lid, and let me take a long, hard look. Citalopram 40mg. Whew.
“You did the right thing,” she assured me. “When in doubt, always go back to the source.”
Fortunately for me, the worst thing that happened is that I missed one dose of a maintenance medication. But I can think of a dozen permutations that could have gone differently, putting me at risk of harm. After all, adverse drug events are among the most common issues in healthcare, causing around 1.5 million emergency department visits and 500,000 hospitalizations each year.
And with millions of people now turning to AI for increasingly complicated health questions every single day, especially as leaders like OpenAI and Anthropic start aggressively pushing their healthcare offerings, it’s becoming very clear that the scaffolding underpinning these AI/consumer relationships is quite fragile indeed.
The underlying problem: Probabilistic vs. deterministic AI models
Skim this part if you don’t want to nerd out, but here’s how and why the errors in identification unfolded.
The fundamental tension stems from the fact that large language models (LLMs) are probabilistic in nature. That means they work on pattern recognition at an incomprehensibly complex scale to identify the most likely (or most likely desired) result to a query. That’s why two users can feed the same prompt to the same model and come out with slightly different answers, even when it seems like there’s only one correct solution (i.e., what a pill really is).
That’s fine when you’re trying to solve a creative problem, such as generating a (more or less) human-sounding blog post or fine-tuning a business plan. Variations in outputs are acceptable and even desired.
But in cases where the answer is deterministic – a pill can only be one thing and should only be identified as such – LLMs can sometimes get the probabilities wrong but present their solution with too high degree of confidence.
That’s exactly what happened with ChatGPT. In its own words, it treated the pill’s physical imprint as a unique identifier that was mapped to Ibuprofen 800mg in all cases at all times, which simply isn’t the case.
“In reality, pill ID is only ‘deterministic’ when you’re using an authoritative, up-to-date database (or the pharmacy’s inventory system) — not when you’re relying on a model’s memory of imprint mappings,” it explained after the fact.
Its repeated confidence in its answer, even after being challenged several times, was due to “pattern overreach,” in which the model locked onto a familiar pattern (i.e. the imprint is very often used for Ibuprofen and must therefore always be used for Ibuprofen and only Ibuprofen).
It also failed to independently verify the imprint with a trusted external source.
“I treated a visual imprint like it was a guaranteed unique key across all manufacturers and time. It isn’t, at least not in the way I used it here,” it said. “Even when imprints are intended to be identifying, my ability to map them correctly without an external database is not reliable. I didn’t actually verify the imprint against the specific manufacturer you later provided.”
Claude made similar probabilistic vs. deterministic mistakes, but with an added visual component. At first, it failed to recognize that a photo of two pills in my hand, one showing the obverse and one the reverse, actually represented dual views of the same medication. This led to a multimodal binding error, in which it could not correctly align the visual input with the textual data.
It then engaged in probabilistic calculations to identify the pill rather than consulting a deterministic source, the same basic misstep as ChatGPT.
“I’m not connected to a live pill identification database,” Claude acknowledged. “When I identify medications from images or imprint descriptions, I’m doing statistical pattern matching against training data — which may be incomplete, outdated, or mis-weighted. The fact that I (or any LLM) deliver that output in confident declarative sentences is the real design problem. The confidence is stylistic, not epistemic.”
Perhaps most fascinating is what happened with the Google output.
Even though its answer was correct, both ChatGPT and Claude agree that there were some fundamental flaws with how it arrived at – and presented – its solution.
ChatGPT explained that Google’s AI search summary acts a little differently by blending straight-up search results with probabilistic AI outputs. But since users can’t see which parts of its answers are drawn directly from web sources and which are generated through AI model inferences, its “confident explanatory tone creates a false sense of authority,” even if it happened to be right in this particular case.
“Being correct does not mean the system behaved correctly. Correct answers from probabilistic systems can still produce unsafe user experiences if the reasoning chain and authority signals are unclear,” ChatGPT cautioned.
And Claude agreed with this assessment. “Google presented a retrieved fact with the same confident tone that I used for a hallucinated one,” it said. “The user has no way to distinguish between ‘Google looked this up in a verified drug database’ and ‘Google synthesized this from pattern matching.’ Both look identical in the interface. That’s a trust calibration problem. A correct answer delivered without transparency about its sourcing doesn’t actually build appropriate trust — it just gets lucky.”
In this case, the Google AI summary can’t defend itself, but it seems fair that the other two models highlighted the fact that being right sometimes doesn’t always mean doing things right all the time, which speaks to the deepest levels of why I can’t emphasize enough how important it is to fully understand and apply thorough, transparent, and consistent data governance principles to AI models in the healthcare setting.
AI physician, heal thyself?
Because I love a little bit of irony, I decided to ask ChatGPT and Claude how they would correct their own errors, fully knowing that now I can never really trust another word they say on the subject after being so assertively wrong about everything so far.
But Claude surprised me by coming out with something that feels 100% correct.
“What this case study really illustrates is that right now, AI occupies an uncanny valley in healthcare: capable enough that patients trust it, not reliable enough to deserve that trust for deterministic safety tasks,” it told me. “Closing that gap is a design and governance problem, not just a technical one.”
Then ChatGPT chimed in with something a little less on the mark: a list of actionable recommendations which sound good in theory, but actually require the user to have a fairly sophisticated ability to understand and analyze data, engage in independent research, and quickly access health system resources.
The model suggested more transparency around sources by clearly labeling where answers come from (primary databases, secondary summaries, pattern matching from training data, or general inferences), as well as presenting a “confidence calibration” that quantifies its uncertainty level.
“An answer like ‘this pill matches imprint X’ should come with a probability estimate, evidence links, and contraindications,” it said.
That’s great, but not if the average user can’t understand or apply them. About 54% of US adults read below a 6th grade reading level, and about a third of adults performed at the lowest proficiency levels in numeracy and adaptive problem solving in 2023.
The answer can’t just be about presenting more information to contextualize potentially wrong answers. It has to be about getting to the right answer more reliably to begin with – and providing stronger, clearer signals or even full stops when there’s any doubt at all about the output.
Sometimes, the only solution is no solution
The hard truth is that probabilistic AI models with chatbot interfaces just might not be suitable for some healthcare tasks. Drug identification, dosage calculations, and medication interactions don’t have quite as much leeway as “should I go to the ED right now for this rash?” Instead, we need to base patient-facing output on structured rules and verified databases in order to get a correct answer every time.
If model developers can build mechanisms to connect their chatbots to these sources and provide a clear and trusted answer, that would be fantastic. But in the absence of these interfaces, even the chatbots themselves agree: the best solution right now is not to give an answer that has any meaningful probability of being incorrect.
Systems that can’t route drug-related queries to structured databases should instead refuse to give the user a probabilistic answer and expect them to know how to interpret the caveats.
The model should simply say “talk to your pharmacist before taking any action,” and continue to refer users to human clinical experts when pressed.
This approach might not make corporate shareholders very happy, since popular models are optimized for engagement and there’s a strong financial incentive to avoid any friction in the user experience that might reduce the time users spend in the app.
But in situations where patient safety is at stake, it’s better to take a small, temporary hit to their engagement metrics than lose their user entirely to medication-related complications.
Trust as a competitive differentiator for AI models
AI isn’t useful in healthcare because it can provide an answer. It’s only useful when it can provide the right answer in the right context with a high enough degree of certainty to act upon the information and achieve a positive result.
Clearly, commercially available AI tools just aren’t there yet in many areas, and users need to know that.
The good news is that user trust is the number one issue in AI right now, with patients increasingly demanding transparency and accountability from both their providers and the companies making AI products.
That could spark a race to be the most transparent, the most accurate, and the most trustworthy on health issues, preventing future users from facing the same merry-go-round of wrong information that I encountered. Until then, however, the best thing a patient can do is to recognize when a system has reached its limits and instead turn to their dispensing pharmacist for a fully trustworthy and verifiable answer.
Jennifer Bresnick is a journalist and freelance content creator with a decade of experience in the health IT industry. Her work has focused on leveraging innovative technology tools to create value, improve health equity, and achieve the promises of the learning health system. She can be reached at [email protected].