Xiaolong Xu
2025
Huatuo-26M, a Large-scale Chinese Medical QA Dataset
Xidong Wang
|
Jianquan Li
|
Shunian Chen
|
Yuxuan Zhu
|
Xiangbo Wu
|
Zhiyi Zhang
|
Xiaolong Xu
|
Junying Chen
|
Jie Fu
|
Xiang Wan
|
Anningzhe Gao
|
Benyou Wang
Findings of the Association for Computational Linguistics: NAACL 2025
Large Language Models infuse newfound vigor into the advancement of the medical domain, yet the scarcity of data poses a significant bottleneck hindering community progress. In this paper, we release the largest ever medical Question Answering (QA) dataset with 26 Million QA pairs named Huatuo-26M. We benchmark many existing approaches in our dataset in terms of both retrieval and generation. We also experimentally show the benefit of the proposed dataset in many aspects: (i) it serves as a fine-tuning data for training medical Large Language Models (LLMs); (ii) it works as an external knowledge source for retrieval-augmented generation (RAG); (iii) it demonstrates transferability by enhancing zero-shot performance on other QA datasets; and (iv) it aids in training biomedical model as a pre-training corpus. Our empirical findings substantiate the dataset’s utility in these domains, thereby confirming its significance as a resource in the medical QA landscape.
Search
Fix data
Co-authors
- Shunian Chen 1
- Junying Chen 1
- Jie Fu 1
- Anningzhe Gao 1
- Jianquan Li 1
- show all...