MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes

Asma Ben Abacha; Wen-wai Yim; Yujuan Fu; Zhaoyi Sun; Meliha Yetisgen-Yildiz; Fei Xia; Thomas Lin

doi:10.18653/v1/2025.findings-acl.1159

MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes

Asma Ben Abacha, Wen-wai Yim, Yujuan Fu, Zhaoyi Sun, Meliha Yetisgen, Fei Xia, Thomas Lin

Abstract

Several studies have shown that Large Language Models (LLMs) can answer medical questions correctly, even outperforming the average human score in some medical exams. However, to our knowledge, no study has been conducted to assess the ability of language models to validate existing or generated medical text for correctness and consistency. In this paper, we introduce MEDEC (https://github.com/abachaa/MEDEC), the first publicly available benchmark for medical error detection and correction in clinical notes, covering five types of errors (Diagnosis, Management, Treatment, Pharmacotherapy, and Causal Organism). MEDEC consists of 3,848 clinical texts, including 488 clinical notes from three US hospital systems that were not previously seen by any LLM. The dataset has been used in the MEDIQA-CORR 2024 shared task to evaluate seventeen participating systems. In this paper, we describe the data creation methods and we evaluate recent LLMs (e.g., o1-preview, GPT-4, Claude 3.5 Sonnet, Gemini 2.0 Flash, and DeepSeek-R1) for the tasks of detecting and correcting medical errors requiring both medical knowledge and reasoning capabilities. We also conducted a comparative study where two medical doctors performed the same task on the MEDEC test set. The results showed that MEDEC is a sufficiently challenging benchmark to assess the ability of models to validate existing or generated notes and to correct medical errors. We also found that although recent LLMs have a good performance in error detection and correction, they are still outperformed by medical doctors in these tasks. We discuss the potential factors behind this gap, the insights from our experiments, the limitations of current evaluation metrics, and share potential pointers for future research.

Anthology ID:: 2025.findings-acl.1159
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 22539–22550
Language:
URL:: https://preview.aclanthology.org/corrections-2025-08/2025.findings-acl.1159/
DOI:: 10.18653/v1/2025.findings-acl.1159
Bibkey:
Cite (ACL):: Asma Ben Abacha, Wen-wai Yim, Yujuan Fu, Zhaoyi Sun, Meliha Yetisgen, Fei Xia, and Thomas Lin. 2025. MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes. In Findings of the Association for Computational Linguistics: ACL 2025, pages 22539–22550, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes (Ben Abacha et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/corrections-2025-08/2025.findings-acl.1159.pdf

PDF Cite Search Fix data