Benchmarking LLMs on Authentic Cases from Medical Journals

Wanlong Liu; Junying Chen; Yunjin Yang; Prayag Tiwari; Wenyu Chen; Benyou Wang

Benchmarking LLMs on Authentic Cases from Medical Journals

Wanlong Liu, Junying Chen, Yunjin Yang, Prayag Tiwari, Wenyu Chen, Benyou Wang

Abstract

In recent years, large language models (LLMs) have demonstrated remarkable capabilities in the medical domain. However, existing medical benchmarks suffer from performance saturation and are predominantly derived from medical exam questions, which fail to reflect the complexity of real-world clinical scenarios.To bridge this gap, we introduce ClinBench, a challenging benchmark based on authentic clinical cases sourced from authoritative medical journals. Each question retains the complete patient information and clinical test results from the original case, effectively simulating real-world clinical practice. Additionally, we implement a rigorous human review process involving medical experts to ensure the quality and reliability of the benchmark. ClinBench supports both textual and multimodal evaluation formats, covering 11 medical specialties with over 2,000 questions, including a dedicated rare disease track, providing a comprehensive resource for assessing the medical reasoning capabilities of LLMs. We evaluate the performance of over 20 open-source and proprietary LLMs and benchmark them against human medical experts. Our findings reveal that human experts still retain an advantage within their specialized fields, while LLMs demonstrate superior overall performance on a broader range of medical specialties.

Anthology ID:: 2026.findings-acl.767
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 15651–15675
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.767/
DOI:
Bibkey:
Cite (ACL):: Wanlong Liu, Junying Chen, Yunjin Yang, Prayag Tiwari, Wenyu Chen, and Benyou Wang. 2026. Benchmarking LLMs on Authentic Cases from Medical Journals. In Findings of the Association for Computational Linguistics: ACL 2026, pages 15651–15675, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Benchmarking LLMs on Authentic Cases from Medical Journals (Liu et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.767.pdf
Checklist:: 2026.findings-acl.767.checklist.pdf

PDF Cite Search Checklist Fix data