MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts

Jiayi He, Yangmin Huang, Qianyun Du, Xiangying Zhou, Zhiyang He, Jiaxue Hu, Xiaodong Tao, Lixian Lai


Abstract
Deploying Large Language Models (LLMs) in medical applications requires rigorous fact-checking to ensure patient safety and regulatory compliance. We introduce **MedFact**, a challenging Chinese medical fact-checking benchmark with 2,116 expert-annotated instances from diverse real-world texts, spanning 13 specialties, 8 error types, 4 writing styles, and 5 difficulty levels. Construction uses a hybrid AI-human framework where iterative expert feedback refines AI-driven, multi-criteria filtering to ensure high quality and difficulty. We evaluate 20 leading LLMs on veracity classification and error localization, and results show that models can often determine whether text contains errors but struggle to localize them precisely, with top performers falling short of human performance. Our analysis reveals an "over-criticism" phenomenon, where models misidentify correct information as erroneous, a tendency that is aggravated by advanced reasoning techniques such as multi-agent collaboration and inference-time scaling. MedFact highlights the challenges of deploying medical LLMs and provides resources to develop factually reliable medical AI systems.
Anthology ID:
2026.gem-main.59
Volume:
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
Venues:
GEM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
604–652
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.59/
DOI:
Bibkey:
Cite (ACL):
Jiayi He, Yangmin Huang, Qianyun Du, Xiangying Zhou, Zhiyang He, Jiaxue Hu, Xiaodong Tao, and Lixian Lai. 2026. MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 604–652, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts (He et al., GEM 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.59.pdf