Can synthetic patient data improve AI?

Artificial intelligence algorithms require more training data than traditional EHRs can provide, and synthetic patient data may the solution.

By admin

Oct 26, 2022, 3:46 PM

Artificial intelligence (AI) holds incredible promises for the healthcare industry, from enhancing clinical decision support and predicting risks to fostering more efficient operations and streamlining the patient experience.

However, algorithms are only as good as the data they are trained upon. Most models require huge volumes of training data, first very carefully curated and then more representative of real-world situations, to learn how to perform their specific tasks.

In the healthcare industry, this data has been very hard to find. Stringent privacy and security protocols, coupled with the fundamental challenges of EHR data quality and interoperability, have made it difficult for AI developers to get access to the large volumes of data they need to train, validate, and optimize their algorithms.

But there may be a viable solution on the horizon: synthetic patients.

These fictional yet realistic records comprise all the necessary data elements for algorithms to work with, such as demographics, healthcare encounters, diseases, allergies, and medications. Yet since they do not represent real, living people, they are free from privacy constraints and can be produced in a structured, interoperable manner at scale.

This strategy could accelerate the development of the learning health system (LHS), asserts a team of researchers from the Learning Health Community (LHC) Learning Health System Technology Forum in an article published in Nature – Scientific Reports.

The team initiated a study to test the feasibility of developing a machine-learning-enabled LHS using synthetic patient data. The goal was to create a risk prediction algorithm for lung cancer.

To do so, the researchers used synthetic data for 150,000 patients created via Synthea, a FHIR-enabled project from the non-profit MITRE Corporation.

“Over 175 million points of data were available from over 13 million encounters for these Synthea patients, including 8 million diagnoses, 111 million observations, 24 million procedures and 15 million medications,” the team explained.

The tool’s predictive performance improved as the researchers trained and refined the algorithm, first using a subset of 30,000 synthetic patients and then employing the full cohort of 150,000 individuals. “Recall increased from 0.849 to 0.936, the precision from 0.944 to 0.962, the AUC from 0.913 to 0.962, and the accuracy from 0.938 to 0.975,” the authors revealed.

They then tested the model by switching its focus to stroke, which occurs more often than lung cancer. The algorithm performed even better on stroke than cancer, validating its capabilities. In addition, the tool exhibited similar levels of performance as previously published stroke risk algorithms built on traditional patient data.

Caveat: Synthetic patient data just a proving ground

It’s important to note that the authors are not currently suggesting that algorithms trained exclusively on synthetic patient data should be used for real-world patient care. There are still notable differences between real-world patients and the fictional records generated by programs like Synthea. Currently, those factors limit the use of machine learning models trained exclusively on synthetic information.

Rather, synthetic patient data provides an opportunity to test and optimize AI tools, and to guide the development of algorithms that use real EHR data as their primary fuel.

“For example, our collaborators in hospitals are utilizing the synthetic data and [machine learning] code available from this study to develop risk prediction models for lung cancer, nasopharyngeal cancer, transient ischemic attack, and stroke using EHR-wide data of about 1 million real patients,” the authors said.

With this hybrid approach to real and synthetic patient data, health systems may be able to improve their predictive capabilities, reduce health disparities, expand the delivery of evidence-based medicine, and avoid unnecessary costs.

“We hope that once hospitals see the transformative benefits of the learning health system approach…they will implement LHS with real patient data to solve specific clinical delivery problems more effectively,” the article concludes.

Jennifer Bresnick is a journalist and freelance content creator with a decade of experience in the health IT industry. Her work has focused on leveraging innovative technology tools to create value, improve health equity, and achieve the promises of the learning health system.