Lan Lan


2025

pdf bib
Large-Scale Corpus Construction and Retrieval-Augmented Generation for Ancient Chinese Poetry: New Method and Data Insights
Yang Liu | Lan Lan | Jiahuan Cao | Hiuyi Cheng | Kai Ding | Lianwen Jin
Findings of the Association for Computational Linguistics: NAACL 2025

Ancient Chinese Poetry (ACP), a critical aspect of Chinese cultural heritage, presents unique challenges for Large Language Models (LLMs). One of the most pressing challenges is the significant hallucination issues faced by LLMs due to data scarcity and limited ability of general LLMs when dealing with ACP. To address these challenges, this paper constructs the ACP-Corpus, which encompasses 1.1 million ancient poems and 990K related texts, designed to enhance the training and performance of LLMs. Alongside this, we develop the ACP-QA dataset, comprising over 12 million question-answer pairs across 24 task categories, and the ACP-Eval dataset for rigorous evaluation purposes, containing 7,050 entries. Building on this resources, we propose the ACP-RAG framework, a specialized Retrieval-Augmented Generation (RAG) approach that significantly improves the performance of LLMs in the domain of ancient poetry from 49.2% to 89.0%. The ACP-RAG contains five modules of semantic coarse-grained retrieval, semantic fine-grained retrieval, keyword retrieval, keyword matching, and context filtering. Experiments show that ACP-RAG achieves a promising response accuracy of 89.0%, surpassing existing LLMs by a remarkable margin. We believe this work not only advances the capabilities of LLMs in processing ancient Chinese poetry but also contributes to the preservation and innovative development within this rich literary tradition. The datasets and code are available at https://github.com/SCUT-DLVCLab/ACP-RAG.