Is healthcare doing enough to study GenAI’s administrative applications?
Healthcare is notorious for being among the most risk-averse industries, and with good reason. Life and death are the bread and butter of healthcare organizations, and the extreme implications of getting something wrong have produced an environment where caution is key when evaluating new technologies or methodologies.
It’s not a bad stance to take, especially as generative AI enters the scene. With ongoing concerns about the accuracy, trustworthiness, and equity of these models in the clinical care environment, it’s no wonder that providers (and patients, not to mention regulators) are not yet comfortable letting these tools play a major role in clinical decision-making.
But healthcare is also an industry of contradictions. In sharp contrast to their reluctance to enlist GenAI for direct clinical assistance, stakeholders are enthusiastically going all-in on AI for administrative use cases, including workflow optimization, revenue cycle management, and patient engagement.
More than 1 in 4 investment dollars is tied to AI these days, and about three-quarters of healthcare organizations are upping the budget on GenAI investment, boosting their funding up to 300% above 2023 levels, according to an April survey by John Snow Labs.
Providers have their sights set on high-value administrative use cases, including using GenAI to create or summarize documentation, assist with coding and billing, smooth out preauthorization processes, and patient relationships.
The general perception is that these are “safe” ways to use GenAI, since they mainly affect the back office and/or what clinicians do in their time away from the bedside. As a result, these applications might not be subject to the same degree of rigorous academic scrutiny as AI-powered clinical decision support tools, according to a new preprint study from a team of researchers based at Stanford University.
In a literature review of studies published between 2022 and 2024, the authors found that only a fraction of articles focused on the administrative applications of large language models (LLMs), the power behind GenAI tools.
Close to 45% of the more than 500 studies focused on using LLMs to assess medical knowledge (AKA figuring out whether Chat-GPT could pass the Turing Test and take a medical licensing exam as well as a human). A further 20% revolved around making diagnoses, while a similar number (17%) examined how LLMs could be used to educate patients.
Only the smallest fraction of studies addressed some of the administrative issues that have emerged as major use cases for LLMs during this first wave of implementation, the authors found.
For example, just 0.2% of studies looked at using LLMs to assign provider billing codes, while 0.8% focused on clinical notetaking.
The landscape was similar when the team looked specifically at natural language processing (NLP) and natural language understanding (NLU) tasks. The vast majority of studies (84%) looked at question answering, while only 9% explored summarization tasks and 3.3% were aimed at topics around conversational dialogue.
On top of the bias toward studying use cases that have not yet been widely put into practice, AI researchers are likely to only use narrow evaluation criteria for the models in question.
Almost all of the studies – over 95% – used “accuracy” as their primary dimension of evaluation. Less than 20% included fairness/bias as a major criterion for success, and just 4.6% took deployment considerations into account when evaluating their models.
The study indicates that the industry might not be spending as much time as necessary on holistically evaluating the way LLMs are actually being used in the healthcare setting, leaving organizations open to risks that their tools may not be quite as well-validated as they would like to believe.
The study of studies raises concerns for organizations that are interested in shielding themselves from liability for using AI tools incorrectly, particularly in light of major lawsuits against a series of insurers for using an algorithm that allegedly wrongfully denied care to Medicare Advantage beneficiaries.
Healthcare leaders need to be aware that while there are varying degrees of risks involved in using AI in the back office, there is no such thing as an AI use case that doesn’t somehow affect patient outcomes. Organizations are responsible for thoroughly vetting their tools across multiple dimensions and pushing developers to adopt and maintain high standards of training, validation, and ongoing monitoring.
Implementors are also responsible for deeply and thoroughly understanding the ripple effect of implementing AI in one area of their operations, especially as healthcare workflows become increasingly interconnected across the entire care continuum.
Ultimately, more research is needed to better understand the real-world AI implementation landscape and the most popular use cases for the current generation of GenAI tools. By taking a more detailed look at how AI is really being applied in the healthcare ecosystem, organizations can more confidently test and refine their tools while expanding the definition of truly “safe” applications for these powerful models.
Jennifer Bresnick is a journalist and freelance content creator with a decade of experience in the health IT industry. Her work has focused on leveraging innovative technology tools to create value, improve health equity, and achieve the promises of the learning health system. She can be reached at jennifer@inklesscreative.com.