EMPEC: A Comprehensive Benchmark for Evaluating Large Language Models Across Diverse Healthcare Professions

Zheheng Luo, Chenhan Yuan, Qianqian Xie, Sophia Ananiadou


Abstract
Recent advancements in Large Language Models (LLMs) show their potential in accurately answering biomedical questions, yet current healthcare benchmarks primarily assess knowledge mastered by medical doctors, neglecting other essential professions. To address this gap, we introduce the Examinations for Medical PErsonnel in Chinese (EMPEC), a comprehensive healthcare knowledge benchmark featuring 157,803 exam questions across 124 subjects and 20 healthcare professions, including underrepresented roles like Optometrists and Audiologists. Each question is tagged for release time and source authenticity. We evaluated 17 LLMs, including proprietary and open-source models, finding that while models like GPT-4 achieved over 75% accuracy, they struggled with specialized fields and alternative medicine. Notably, we find that most medical-specific LLMs underperform their general-purpose counterparts in EMPEC, and incorporating EMPEC’s data in fine-tuning improves performance. In addition, we tested LLMs on questions released after the completion of their training to examine their ability in unseen queries. We also translated the test set into English and simplified Chinese and analyse the impact on different models. Our findings emphasize the need for broader benchmarks to assess LLM applicability in real-world healthcare, and we will provide the dataset and evaluation toolkit for future research.
Anthology ID:
2025.findings-acl.518
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9945–9958
Language:
URL:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.518/
DOI:
Bibkey:
Cite (ACL):
Zheheng Luo, Chenhan Yuan, Qianqian Xie, and Sophia Ananiadou. 2025. EMPEC: A Comprehensive Benchmark for Evaluating Large Language Models Across Diverse Healthcare Professions. In Findings of the Association for Computational Linguistics: ACL 2025, pages 9945–9958, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
EMPEC: A Comprehensive Benchmark for Evaluating Large Language Models Across Diverse Healthcare Professions (Luo et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.518.pdf