Research Explores the Value of Rapid Testing of AI in Healthcare

Research Explores the Value of Rapid Testing of AI in Healthcare.

Share this

This article was written by Joshua Biro, PhD and Raj Ratwani, PhD.

 

Two studies explored how rapid evaluation of new technologies can help humans and AI work better together. Studies like these can help speed up the safe adoption of new and innovative technologies by identifying gaps in AI and human performance. 

 

In healthcare, artificial intelligence technologies influence how providers and patients experience care. As we seek to understand how humans and AI can work together, research can help us provide safe, efficient care.


At the National Center for Human Factors in Healthcare, we’re researching ways to quickly evaluate AI technologies to ensure we understand the benefits and risks. We work closely with MedStar Health teams to turn new knowledge into practice, keeping patients safe while improving efficiency for providers.


A 2024 Healthcare Information and Management Systems Society survey found that 86% of more than 800 healthcare providers reported using AI in their organizations. Yet, we’re still learning the workflow implications of new technologies.


Electronic Health Records (EHRs) have made it simple for patients to message their doctors—and patients expect and deserve a timely response. Studies show primary care physicians spend more than half their day on EHR-related messaging and documentation, a factor in about 50% reporting burnout. 


AI technologies can help doctors provide timely, responsive patient care while reducing the administrative overload that leads to burnout. When implementing new solutions, it’s critical to identify any potential risks these new technologies might pose, such as inaccuracy or privacy concerns.


In two recent research studies, we explored how well AI technologies could help providers respond to patient messages and take notes in the clinic. The results allow us to implement new technologies faster while maintaining the highest standards of patient care.


Enabling efficient responses to patient portal messages.

Studies show that primary care physicians get almost 50 electronic messages from patients each day, the most of any specialty surveyed. 


To help doctors respond to these messages, generative AI applications have been trained to draft responses. Yet we know that generative AI can introduce errors or biases, including inaccurate or outdated information, so physicians must review these draft messages for accuracy. 


Before these technologies are used at MedStar Health, we’re working to understand the risks associated with AI-introduced errors. Our study, published in NPJ Digital Medicine, explored how primary care physicians can likely catch AI errors and their experiences using AI for patient messaging.


To do this, we presented 20 primary care physicians with 18 AI-generated drafts of responses to patient messages—four of these drafts contained errors, including:

  • A typographical error

  • Outdated vaccination advice

  • Misstating the urgency of the risk of a blood clot

  • Failing to address the possibility of diabetic ketoacidosis, a condition that requires urgent intervention

We analyzed the doctors’ editing performance and surveyed them about the experience. 


Data showed that AI-introduced errors are difficult to catch. Participants missed about two-thirds of errors – only one participant caught them all. 


Previous human factors research provides some reasonable explanations for this performance, including:

  • Functional Fixedness Trap: Cognitive bias limits our ability to innovate when given an expected solution. When diagnoses provided by AI are expected, it is harder to consider less likely options. 

  • Confirmation Bias: The tendency to interpret information in a way that confirms our existing beliefs. Providers could be less critical of AI drafts that agree with their expectations.

  • Automation Complacency: As automated systems become more reliable, humans tend to monitor them more poorly, making us less likely to spot AI errors.

  • Automation Bias: A tendency to rely too much on automation. Automation bias happens more often among people who are overworked and looking for solutions.

To help humans find errors, application developers could highlight critical sections, such as medication names, so doctors can be sure to verify these portions. This study means that MedStar Health’s research and implementation teams can work together to ensure training materials flag this issue. That way, providers can know to increase their diligence and safeguard patients from errors when AI software to draft patient portal messaging is implemented here at MedStar Health.


Related reading: How Health Systems and Policymakers Can Prioritize Patient Safety When Integrating AI


Sim testing for quicker analysis of ambient digital scribes.

Ambient digital scribes are another standard AI tool in healthcare today. These tools can “listen” to conversations between the patient and provider, then provide notes of the visit for the provider to review and include in the patient’s EHR. 


Health systems across the nation, including MedStar Health, are adopting ambient digital scribes because they can:

  • Improve the quality of care and patient experience, allowing physicians to focus on patients instead of documentation during visits

  • Streamline documentation to help reduce burnout among overworked clinicians

However, AI note-taking technologies could include inaccurate or incomplete transcriptions. Our study used day-long simulation (sim) testing to quickly assess the risks and benefits of ambient digital scribes in one day—real-world pilot testing can take months.


Our sim test case study, published in JAMIA, the Journal of the American Medical Informatics Association, included two teams of two people: one a physician and the other a researcher representing a patient. They read 11 scripts of real outpatient conversations while two ambient digital scribes took notes. This resulted in 44 draft notes that the physicians then edited. Researchers analyzed their perceived effort, time to edit, and success at identifying errors. 


Our results showed it took physicians almost four and a half minutes to correct each note, and they perceived the difficulty as “moderate.” In 31 of 44 draft notes, we found 127 or 2.88 errors per note. 


The most frequent type of error was omission, meaning that something discussed was not included in the note. This happened more often than “hallucination,” a known risk of generative AI in which fictitious information is included. The difference is notable: Missing content can be more difficult to spot than incorrect content.


This simulation testing case study shows that we can quickly learn a lot about AI products that can help us make good decisions about how they’re being used today. Follow-up pilot studies will be crucial to safe implementation.


Related reading: How Research Can Drive Improved Healthcare Safety and Equity When Using AI.


New technologies to benefit patients and providers.

The development cycle for AI is very fast, and our research needs to be responsive. Our tight-knit innovation, adoption, research, and operationalization cycle helps MedStar’s clinicians stay at the forefront. 


When care teams are better aware of the potential shortcomings of the technology, we can more carefully monitor their output to ensure patient safety and help more providers feel comfortable adopting well-tested methods to improve efficiency and patient communication. 


. Critically, we can help the providers who use these applications be on the lookout for opportunities to increase accuracy and support safety.


Want more information about the MedStar Health Research Institute?

Discover how we’re innovating for tomorrow.

Explore With Us

Stay up to date and subscribe to our blog

Latest blogs