Qianyun Du


2025

pdf bib
Should I Believe in What Medical AI Says? A Chinese Benchmark for Medication Based on Knowledge and Reasoning
Yue Wu | Yangmin Huang | Qianyun Du | Lixian Lai | Zhiyang He | Jiaxue Hu | Xiaodong Tao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Large language models (LLMs) show potential in healthcare but often generate hallucinations, especially when handling unfamiliar information. In medication, a systematic benchmark to evaluate model capabilities is lacking, which is critical given the high-risk nature of medical information. This paper introduces a Chinese benchmark aimed at assessing models in medication tasks, focusing on knowledge and reasoning across six datasets: indication, dosage and administration, contraindicated population, mechanisms of action, drug recommendation, and drug interaction. We evaluate eight closed-source and five open-source models to identify knowledge boundaries, providing the first systematic analysis of limitations and risks in proprietary medical models.