As the integration of artificial intelligence into medicine advances, there is also a growing interest in using AI models to interpret complex medical information, a step beyond AI’s traditional medical uses.
AI’s current medical tasks are more about task automation and pattern recognition, used in applications such as chatbots that answer patient queries, algorithms predicting disease, synthetic data generation for privacy, and educational tools for medical students.
But despite these strides, interpreting medical information involves a higher level of comprehension and distinction of complex medical concepts and comes with life-or-death consequences.
A study recently published in the peer-reviewed journal Computers in Biology and Medicine by researchers at Ben-Gurion University of the Negev sheds new light on the performance of AI models in deciphering medical data, revealing both their potential and significant limitations.
Doctoral student Ofir Ben Shoham and Dr. Nadav Rappaport from the university’s Department of Software and Information Systems Engineering conducted a study to evaluate how effectively AI models comprehend medical concepts. They developed a dedicated evaluation tool called “MedConceptsQA,” which includes over 800,000 questions spanning various levels of complexity. This tool was designed to assess the models’ ability to interpret medical codes and concepts, such as diagnoses, procedures, and medications.
Questions in *MedConceptsQA* were categorized into three levels of difficulty: Easy, requiring basic medical knowledge; medium, demanding a moderate understanding of medical concepts; and difficult, which tested the ability to discern nuanced differences between closely related medical terms.
The results were surprising. Most AI models, including those specifically trained on medical datasets, performed poorly, often no better than random guessing. However, some general-purpose models, such as ChatGPT-4, outperformed others, achieving an accuracy rate of approximately 60%. While better than random, this performance still falls short of the precision required for critical medical decisions.
“It often seems that models specifically trained for medical needs achieve accuracy levels close to random guessing. Even specialized training on medical data does not necessarily translate to superior performance in interpreting medical codes,” said Rappaport.
Interestingly, general-purpose AI models like ChatGPT-4 and Llama 3-70B outperformed specialized clinical models, such as OpenBioLLM-70B, by 9–11%. This highlighted both the limitations of current clinical models and the adaptability of general-purpose models, despite their lack of a medical focus, the researchers said.
The study shows that AI models need more specialized training on diverse, high-quality clinical data to better understand medical codes and concepts. This could lead to the development of more effective AI tools. With further advancements, AI could help in triaging patients, recommending treatments based on medical history, or flagging potential errors in diagnoses.
The findings also suggests that AI models need better training to handle the complexity of medical coding, which could streamline administrative tasks and improve efficiency in healthcare systems.
“Our benchmark serves as a valuable resource for evaluating large language models’ abilities to interpret medical codes and distinguish between concepts,” explained Ben Shoham. “It allows us to test new models as they are released and compare them with existing ones.”