Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models

Jie Liu; Wenxuan Wang; Su Yihang; Jingyuan Huang; Yudi Zhang; Cheng-Yi Li; Wenting Chen; Xiaohan Xing; Kao-Jung Chang; Linlin Shen; Michael R. Lyu

Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models

Jie Liu, Wenxuan Wang, Su Yihang, Jingyuan Huang, Yudi Zhang, Cheng-Yi Li, Wenting Chen, Xiaohan Xing, Kao-Jung Chang, Linlin Shen, Michael R. Lyu

Abstract

The significant breakthroughs of Medical Multi-Modal Large Language Models (Med-MLLMs) renovate modern healthcare with robust information synthesis and medical decision support. However, these models are often evaluated on benchmarks that are unsuitable for the Med-MLLMs due to the intricate nature of the real-world diagnostic frameworks, which encompass diverse medical specialties and involve complex clinical decisions. Thus, a clinically representative benchmark is highly desirable for credible Med-MLLMs evaluation. To this end, we introduce Asclepius, a novel Med-MLLM benchmark that comprehensively assesses Med-MLLMs in terms of: distinct medical specialties (cardiovascular, gastroenterology, etc.) and different diagnostic capacities (perception, disease analysis, etc.). Grounded in 3 proposed core principles, Asclepius ensures a comprehensive evaluation by encompassing 15 medical specialties, stratifying into 3 main categories and 8 sub-categories of clinical tasks, and exempting overlap with the existing VQA dataset. We further provide an in-depth analysis of 6 Med-MLLMs and compare them with 3 human specialists, providing insights into their competencies and limitations in various medical contexts. Our work not only advances the understanding of Med-MLLMs’ capabilities but also sets a precedent for future evaluations and the safe deployment of these models in clinical environments.

Anthology ID:: 2025.acl-long.1178
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 24181–24201
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1178/
DOI:
Bibkey:
Cite (ACL):: Jie Liu, Wenxuan Wang, Su Yihang, Jingyuan Huang, Yudi Zhang, Cheng-Yi Li, Wenting Chen, Xiaohan Xing, Kao-Jung Chang, Linlin Shen, and Michael R. Lyu. 2025. Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 24181–24201, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models (Liu et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1178.pdf

PDF Cite Search Fix data