CareMedEval Dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field

Doria Bonzi; Alexandre Guiggi; Frederic Bechet; Carlos Ramisch; Benoit Favre

CareMedEval Dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field

Doria Bonzi, Alexandre Guiggi, Frederic Bechet, Carlos Ramisch, Benoit Favre

Abstract

Critical appraisal of scientific literature is an essential skill in the biomedical field. While large language models (LLMs) can offer promising support in this task, their reliability remains limited, particularly for critical reasoning in specialized domains. We introduce CareMedEval, an original dataset designed to evaluate LLMs on biomedical critical appraisal and reasoning tasks. Derived from authentic exams taken by French medical students, the dataset contains 534 questions based on 37 scientific articles. Unlike existing benchmarks, CareMedEval explicitly evaluates critical reading and reasoning grounded in scientific papers. Benchmarking state-of-the-art generalist and biomedical-specialized LLMs under various context conditions reveals the difficulty of the task: open and commercial models fail to exceed an Exact Match Rate of 0.5 even though generating intermediate reasoning tokens considerably improves the results. Yet, models remain challenged especially on questions about study limitations and statistical analysis. CareMedEval provides a challenging benchmark for grounded reasoning, exposing current LLM limitations and paving the way for future development of automated support for critical appraisal.

Anthology ID:: 2026.lrec-main.404
Volume:: Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:: May
Year:: 2026
Address:: Palma de Mallorca, Spain
Editors:: Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:: LREC
SIG:
Publisher:: ELRA Language Resource Association
Note:
Pages:: 5169–5181
Language:
URL:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.404/
DOI:
Bibkey:
Cite (ACL):: Doria Bonzi, Alexandre Guiggi, Frederic Bechet, Carlos Ramisch, and Benoit Favre. 2026. CareMedEval Dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field. International Conference on Language Resources and Evaluation, main:5169–5181.
Cite (Informal):: CareMedEval Dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field (Bonzi et al., LREC 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.404.pdf

PDF Cite Search Fix data