Zhaoyi Sun
2025
MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes
Asma Ben Abacha
|
Wen-wai Yim
|
Yujuan Fu
|
Zhaoyi Sun
|
Meliha Yetisgen
|
Fei Xia
|
Thomas Lin
Findings of the Association for Computational Linguistics: ACL 2025
Several studies have shown that Large Language Models (LLMs) can answer medical questions correctly, even outperforming the average human score in some medical exams. However, to our knowledge, no study has been conducted to assess the ability of language models to validate existing or generated medical text for correctness and consistency. In this paper, we introduce MEDEC (https://github.com/abachaa/MEDEC), the first publicly available benchmark for medical error detection and correction in clinical notes, covering five types of errors (Diagnosis, Management, Treatment, Pharmacotherapy, and Causal Organism). MEDEC consists of 3,848 clinical texts, including 488 clinical notes from three US hospital systems that were not previously seen by any LLM. The dataset has been used in the MEDIQA-CORR 2024 shared task to evaluate seventeen participating systems. In this paper, we describe the data creation methods and we evaluate recent LLMs (e.g., o1-preview, GPT-4, Claude 3.5 Sonnet, Gemini 2.0 Flash, and DeepSeek-R1) for the tasks of detecting and correcting medical errors requiring both medical knowledge and reasoning capabilities. We also conducted a comparative study where two medical doctors performed the same task on the MEDEC test set. The results showed that MEDEC is a sufficiently challenging benchmark to assess the ability of models to validate existing or generated notes and to correct medical errors. We also found that although recent LLMs have a good performance in error detection and correction, they are still outperformed by medical doctors in these tasks. We discuss the potential factors behind this gap, the insights from our experiments, the limitations of current evaluation metrics, and share potential pointers for future research.
2024
Overview of the MEDIQA-M3G 2024 Shared Task on Multilingual Multimodal Medical Answer Generation
Wen-wai Yim
|
Asma Ben Abacha
|
Yujuan Fu
|
Zhaoyi Sun
|
Fei Xia
|
Meliha Yetisgen
|
Martin Krallinger
Proceedings of the 6th Clinical Natural Language Processing Workshop
Remote patient care provides opportunities for expanding medical access, saving healthcare costs, and offering on-demand convenient services. In the MEDIQA-M3G 2024 Shared Task, researchers explored solutions for the specific task of dermatological consumer health visual question answering, where user generated queries and images are used as input and a free-text answer response is generated as output. In this novel challenge, eight teams with a total of 48 submissions were evaluated across three language test sets. In this work, we provide a summary of the dataset, as well as results and approaches. We hope that the insights learned here will inspire future research directions that can lead to technology that deburdens clinical workload and improves care.
Overview of the MEDIQA-CORR 2024 Shared Task on Medical Error Detection and Correction
Asma Ben Abacha
|
Wen-wai Yim
|
Yujuan Fu
|
Zhaoyi Sun
|
Fei Xia
|
Meliha Yetisgen
Proceedings of the 6th Clinical Natural Language Processing Workshop
Automatic detection and correction of medical errors enables a more rigorous validation of medical documentation as well as clinical notes generated by large language models. Such solutions can ensure the accuracy and medical coherence of clinical texts and enhance patient care and health outcomes. The MEDIQA-CORR 2024 shared task focused on detecting and correcting different types of medical errors in clinical texts. Seventeen teams participated in the shared task and experimented with a broad range of approaches and models. In this paper, we describe the MEDIQA-CORR task, datasets, and the participants’ results and methods.
Search
Fix author
Co-authors
- Asma Ben Abacha 3
- Yujuan Fu 3
- Fei Xia 3
- Meliha Yetisgen-Yildiz 3
- Wen-wai Yim 3
- show all...