CMedCalc-Bench: A Fine-Grained Benchmark for Chinese Medical Calculations in LLM

Yunyan Zhang; Zhihong Zhu; Xian Wu

CMedCalc-Bench: A Fine-Grained Benchmark for Chinese Medical Calculations in LLM

Abstract

Large Language Models (LLMs) have demonstrated significant potential in medical diagnostics and clinical decision-making. While benchmarks such as MedQA and PubMedQA have advanced the evaluation of qualitative reasoning, existing medical NLP benchmarks still face two limitations: the absence of a Chinese benchmark for medical calculation tasks, and the lack of fine-grained evaluation of intermediate reasoning. In this paper, we introduce CMedCalc-Bench, a new benchmark designed for Chinese medical calculation. CMedCalc-Bench covers 69 calculators across 12 clinical departments, featuring over 1,000 real-world patient cases. Building on this, we design a fine-grained evaluation framework that disentangles clinical entity extraction from numerical computation, enabling systematic diagnosis of model deficiencies. Experiments across four model families, including medical-specialized and reasoning-focused, provide an assessment of their strengths and limitations on Chinese medical calculation. Furthermore, explorations on faithful reasoning and the demonstration effect offer early insights into advancing safe and reliable clinical computation.

Anthology ID:: 2025.emnlp-main.1302
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 25661–25670
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1302/
DOI:
Bibkey:
Cite (ACL):: Yunyan Zhang, Zhihong Zhu, and Xian Wu. 2025. CMedCalc-Bench: A Fine-Grained Benchmark for Chinese Medical Calculations in LLM. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25661–25670, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: CMedCalc-Bench: A Fine-Grained Benchmark for Chinese Medical Calculations in LLM (Zhang et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1302.pdf
Checklist:: 2025.emnlp-main.1302.checklist.pdf

PDF Cite Search Checklist Fix data