Xiaolong Xu


2025

pdf bib
Huatuo-26M, a Large-scale Chinese Medical QA Dataset
Xidong Wang | Jianquan Li | Shunian Chen | Yuxuan Zhu | Xiangbo Wu | Zhiyi Zhang | Xiaolong Xu | Junying Chen | Jie Fu | Xiang Wan | Anningzhe Gao | Benyou Wang
Findings of the Association for Computational Linguistics: NAACL 2025

Large Language Models infuse newfound vigor into the advancement of the medical domain, yet the scarcity of data poses a significant bottleneck hindering community progress. In this paper, we release the largest ever medical Question Answering (QA) dataset with 26 Million QA pairs named Huatuo-26M. We benchmark many existing approaches in our dataset in terms of both retrieval and generation. We also experimentally show the benefit of the proposed dataset in many aspects: (i) it serves as a fine-tuning data for training medical Large Language Models (LLMs); (ii) it works as an external knowledge source for retrieval-augmented generation (RAG); (iii) it demonstrates transferability by enhancing zero-shot performance on other QA datasets; and (iv) it aids in training biomedical model as a pre-training corpus. Our empirical findings substantiate the dataset’s utility in these domains, thereby confirming its significance as a resource in the medical QA landscape.