ReMedQA: Are We Done With Medical Multiple-Choice Benchmarks?
Alessio Cocchieri, Luca Ragazzi, Giuseppe Tagliavini, Gianluca Moro
Abstract
Medical multiple-choice question answering (MCQA) benchmarks show that models achieve near-human accuracy, with some benchmarks approaching saturation–leading to claims of clinical readiness. Yet a single accuracy score is a poor proxy for competence: models that change answers under minor input perturbations cannot be considered reliable. We argue that reliability underpins accuracy–only consistent predictions make correctness meaningful. We release ReMedQA, a new benchmark that augments three standard medical MCQA datasets with open-ended answers and systematically perturbed options. Building on this, we introduce ReAcc and ReCon, two reliability metrics: ReAcc measures the proportion of questions answered correctly across all variations, while ReCon measures the proportion answered consistently regardless of correctness. Our evaluation shows that high MCQA accuracy masks low reliability: models remain sensitive to format and perturbation changes, and domain specialization offers no robustness gain. MCQA underestimates smaller models while inflating large ones that exploit structural cues–with some exceeding 50% accuracy even when the original questions are hidden. This shows that, despite near-saturated accuracy, we are not yet done with medical MCQA benchmarks.- Anthology ID:
- 2026.eacl-long.124
- Volume:
- Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- March
- Year:
- 2026
- Address:
- Rabat, Morocco
- Editors:
- Vera Demberg, Kentaro Inui, Lluís Marquez
- Venue:
- EACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 2706–2738
- Language:
- URL:
- https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.124/
- DOI:
- Cite (ACL):
- Alessio Cocchieri, Luca Ragazzi, Giuseppe Tagliavini, and Gianluca Moro. 2026. ReMedQA: Are We Done With Medical Multiple-Choice Benchmarks?. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2706–2738, Rabat, Morocco. Association for Computational Linguistics.
- Cite (Informal):
- ReMedQA: Are We Done With Medical Multiple-Choice Benchmarks? (Cocchieri et al., EACL 2026)
- PDF:
- https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.124.pdf