Rongchu Zhou
2025
A Comprehensive Literary Chinese Reading Comprehension Dataset with an Evidence Curation Based Solution
Dongning Rao
|
Rongchu Zhou
|
Peng Chen
|
Zhihua Jiang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Low-resource language understanding is challenging, even for large language models (LLMs). An epitome of this problem is the CompRehensive lIterary chineSe readIng comprehenSion (CRISIS), whose difficulties include limited linguistic data, long input, and insight-required questions. Besides the compelling necessity of providing a larger dataset for CRISIS, excessive information, order bias, and entangled conundrums still haunt the CRISIS solutions. Thus, we present the eVIdence cuRation with opTion shUffling and Abstract meaning representation-based cLauses segmenting (VIRTUAL) procedure for CRISIS, with the largest dataset. While the dataset is also named CRISIS, it results from a three-phase construction process, including question selection, data cleaning, and a silver-standard data augmentation step, which augments translations, celebrity profiles, government jobs, reign mottos, and dynasty to CRISIS. The six steps of VIRTUAL include embedding, shuffling, abstract beaning representation based option segmenting, evidence extracting, solving, and voting. Notably, the evidence extraction algorithm facilitates literary Chinese evidence sentences, translated evidence sentences, and annotations of keywords with a similarity-based ranking strategy. While CRISIS congregates understanding-required questions from seven sources, the experiments on CRISIS substantiate the effectiveness of VIRTUAL, with a 7 percent hike in accuracy compared with the baseline. Interestingly, both non-LLMs and LLMs have order bias, and abstract beaning representation based option segmenting is constructive for CRISIS.