Medical data used to train AI is often “anonymized” to protect private health information, a practice crucial for HIPAA compliance. To address the limitations of traditional anonymization methods that may compromise accuracy and realism, Johns Hopkins researchers have developed DREAM, an AI system capable of generating synthetic patient portal messages that can be used to train large language models. DREAM is publicly available on GitHub.
Their approach, described in the Journal of Biomedical Informatics, overcomes a critical challenge in medical AI: maintaining the privacy of patient records while preserving the accuracy of patient data.
“High-quality synthetic medical data can significantly advance health research and improve patient care,” says the study’s senior author Casey Overby Taylor, an associate professor of biomedical engineering and medicine affiliated with the Malone Center. “By using large language models to create realistic datasets, we can develop potentially useful and meaningful AI models without the patient privacy concerns that come from using real data.”
Taylor says the study is one of the first to utilize large language models, or LLMs, to generate realistic patient portal communications. Collaborators on the study include Ayah Zirikly, a Malone affiliate and assistant research scientist in the Whiting School of Engineering’s Center for Language and Speech Processing, and Natalie Wang, a PhD student of computer science.
Using OpenAI’s GPT-4, the researchers generated synthetic messages from patients about symptoms and medications, with a focus on creating prompting techniques to produce the most natural, humanlike messages.
“Prompt engineering is the process of feeding the model explicit instructions, or text inputs, for doing exactly what you want the AI to do—not unlike giving someone a detailed recipe for making a cake,” says Wang. “In this case, we based our prompts on a carefully classified set of real-world messages sent from patients to their providers.”
For the study, the team examined how various prompts influenced 450 patient messages generated by AI. Using a tool called Linguistic Inquiry and Word Count—which analyzes text for various psychological and emotional content—the researchers assigned each message a score based on politeness and sentiment. They also evaluated how accurately symptoms and related medications were mentioned together across the different prompt options.
What they learned was that DREAM-generated messages were very similar in quality and tone to those written by human patients. For example, messages with “high urgency” in the prompt had the greatest “need” sentiment scores, suggesting a realistically heightened sense of urgency.
However, Taylor, Wang, and Zirikly note that their study also highlights the problem of racial bias in synthetic data generation, finding that AI-generated messages still perpetuate harmful biases. When they prompted ChatGPT to generate messages supposedly written by individuals of certain races, the resulting messages often received lower politeness scores. Additionally, the accuracy of symptom-medication pairings was lower for prompts including the words “Black or African American,” and highest for prompts including “white.” The researchers say that properly prompting generative AI is crucial for minimizing such biases.
Overall, the team believes that synthetic data produced by LLMs will eventually speed up medical research and discovery by allowing researchers to experiment with AI applications using high-quality data while still adhering to privacy rules.
“Synthetic datasets may help to build more robust tools to direct patient portal messages to different provider types or administrative staff, depending on the presence, urgency, and severity of side effects mentioned in those messages,” says Taylor. “Given our interest in genomic medicine, we are next exploring ways to detect messages appropriate to direct for review by a pharmacogenomics expert.”
Additional authors of the study include Yuzhi Lu, Sukrit Treewaree, Michelle Nguyen, Bhavik Agarwal, Jash Shah, and James Stevenson.