Dingling Xu
2026
CheckRLM: Effective Knowledge–Thought Coherence Checking in Retrieval-Augmented Reasoning
Dingling Xu | Ruobing Wang | Qingfei Zhao | Yukun Yan | Zhichun Wang | Daren Zha | Shi Yu | Zhenghao Liu | Shuo Wang | Xu Han | Maosong Sun
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Dingling Xu | Ruobing Wang | Qingfei Zhao | Yukun Yan | Zhichun Wang | Daren Zha | Shi Yu | Zhenghao Liu | Shuo Wang | Xu Han | Maosong Sun
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reasoning Language Models (RLMs) have significantly improved performance on complex tasks by extending the reasoning chain. However, these chains are prone to containing factual errors, particularly in knowledge-intensive tasks. To address this issue, we propose **CheckRLM**, a framework that improves the reliability of the reasoning process through Retrieval-Augmented Generation (RAG) by timely checking and correcting factual errors. Specifically, CheckRLM extracts factual claims from the reasoning chain to identify and localize subtle knowledge inconsistencies during inference. Upon detection of errors, a refinement mechanism performs minimal-cost yet precise corrections by leveraging external knowledge, ensuring coherence between the reasoning chain and correct knowledge. Extensive experiments demonstrate that CheckRLM substantially outperforms existing baselines, exhibiting a strong capability to mitigate error accumulation in long-horizon reasoning with lower costs. The code and data are available at https://github.com/AI9Stars/CheckRLM.
2025
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework
Kunlun Zhu | Yifan Luo | Dingling Xu | Yukun Yan | Zhenghao Liu | Shi Yu | Ruobing Wang | Shuo Wang | Yishan Li | Nan Zhang | Xu Han | Zhiyuan Liu | Maosong Sun
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Kunlun Zhu | Yifan Luo | Dingling Xu | Yukun Yan | Zhenghao Liu | Shi Yu | Ruobing Wang | Shuo Wang | Yishan Li | Nan Zhang | Xu Han | Zhiyuan Liu | Maosong Sun
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Retrieval-Augmented Generation (RAG) is a powerful approach that enables large language models (LLMs) to incorporate external knowledge. However, evaluating the effectiveness of RAG systems in specialized scenarios remains challenging due to the high costs of data construction and the lack of suitable evaluation metrics. This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios by generating high-quality documents, questions, answers, and references through a schema-based pipeline. With a focus on factual accuracy, we propose three novel metrics—Completeness, Hallucination, and Irrelevance—to evaluate LLM-generated responses rigorously. Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples. Furthermore, the use of LLMs for scoring the proposed metrics demonstrates a high level of consistency with human evaluations. RAGEval establishes a new paradigm for evaluating RAG systems in real-world applications. The code and dataset are released at https://github.com/OpenBMB/RAGEval.