Questioning Our Questions: How Well Do Medical QA Benchmarks Evaluate Clinical Capabilities of Language Models?

Siun Kim; Hyung-Jin Yoon

Questioning Our Questions: How Well Do Medical QA Benchmarks Evaluate Clinical Capabilities of Language Models?

Abstract

Recent advances in large language models (LLMs) have led to impressive performance on medical question-answering (QA) benchmarks. However, the extent to which these benchmarks reflect real-world clinical capabilities remains uncertain. To address this gap, we systematically analyzed the correlation between LLM performance on major medical QA benchmarks (e.g., MedQA, MedMCQA, PubMedQA, and MMLU medicine subjects) and clinical performance in real-world settings. Our dataset included 702 clinical evaluations of 85 LLMs from 168 studies. Benchmark scores demonsrated a moderate correlation with clinical performance (Spearman’s rho = 0.59), albeit substantially lower than inter-benchmark correlations. Among them, MedQA was the most predictive but failed to capture essential competencies such as patient communication, longitudinal care, and clinical information extraction. Using Bayesian hierarchical modeling, we estimated representative clinical performance and identified GPT-4 and GPT-4o as consistently top-performing models, often matching or exceeding human physicians. Despite longstanding concerns about the clinical validity of medical QA benchmarks, this study offers the first quantitative analysis of their alignment with real-world clinical performance.

Anthology ID:: 2025.bionlp-1.24
Volume:: ACL 2025
Month:: August
Year:: 2025
Address:: Viena, Austria
Editors:: Dina Demner-Fushman, Sophia Ananiadou, Makoto Miwa, Junichi Tsujii
Venues:: BioNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 274–296
Language:
URL:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.bionlp-1.24/
DOI:
Bibkey:
Cite (ACL):: Siun Kim and Hyung-Jin Yoon. 2025. Questioning Our Questions: How Well Do Medical QA Benchmarks Evaluate Clinical Capabilities of Language Models?. In ACL 2025, pages 274–296, Viena, Austria. Association for Computational Linguistics.
Cite (Informal):: Questioning Our Questions: How Well Do Medical QA Benchmarks Evaluate Clinical Capabilities of Language Models? (Kim & Yoon, BioNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.bionlp-1.24.pdf
Supplementarymaterial:: 2025.bionlp-1.24.SupplementaryMaterial.txt
Supplementarymaterial:: 2025.bionlp-1.24.SupplementaryMaterial.zip

PDF Cite Search Supplementarymaterial Supplementarymaterial Fix data