Benchmarking LLMs on Authentic Cases from Medical Journals
Wanlong Liu, Junying Chen, Yunjin Yang, Prayag Tiwari, Wenyu Chen, Benyou Wang
Abstract
In recent years, large language models (LLMs) have demonstrated remarkable capabilities in the medical domain. However, existing medical benchmarks suffer from performance saturation and are predominantly derived from medical exam questions, which fail to reflect the complexity of real-world clinical scenarios.To bridge this gap, we introduce ClinBench, a challenging benchmark based on authentic clinical cases sourced from authoritative medical journals. Each question retains the complete patient information and clinical test results from the original case, effectively simulating real-world clinical practice. Additionally, we implement a rigorous human review process involving medical experts to ensure the quality and reliability of the benchmark. ClinBench supports both textual and multimodal evaluation formats, covering 11 medical specialties with over 2,000 questions, including a dedicated rare disease track, providing a comprehensive resource for assessing the medical reasoning capabilities of LLMs. We evaluate the performance of over 20 open-source and proprietary LLMs and benchmark them against human medical experts. Our findings reveal that human experts still retain an advantage within their specialized fields, while LLMs demonstrate superior overall performance on a broader range of medical specialties.- Anthology ID:
- 2026.findings-acl.767
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 15651–15675
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.767/
- DOI:
- Cite (ACL):
- Wanlong Liu, Junying Chen, Yunjin Yang, Prayag Tiwari, Wenyu Chen, and Benyou Wang. 2026. Benchmarking LLMs on Authentic Cases from Medical Journals. In Findings of the Association for Computational Linguistics: ACL 2026, pages 15651–15675, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Benchmarking LLMs on Authentic Cases from Medical Journals (Liu et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.767.pdf