Dingling Xu
2026
R-Search: Empowering LLM Reasoning with Search via Multi-Reward Reinforcement Learning
Qingfei Zhao | Ruobing Wang | Dingling Xu | Daren Zha | Ma Bowen | Zhichun Wang | Shijie Jia | Limin Liu | Xin Wang
Findings of the Association for Computational Linguistics: ACL 2026
Qingfei Zhao | Ruobing Wang | Dingling Xu | Daren Zha | Ma Bowen | Zhichun Wang | Shijie Jia | Limin Liu | Xin Wang
Findings of the Association for Computational Linguistics: ACL 2026
Large language models (LLMs) have notably progressed in multi-step and long-chain reasoning. However, extending their reasoning capabilities to encompass deep interactions with search remains a non-trivial challenge, as models often fail to identify optimal reasoning–search interaction trajectories, resulting in suboptimal responses. We propose R-Search, a novel reinforcement learning framework for Reasoning–Search integration, designed to enable LLMs to autonomously execute multi-step reasoning with deep search interaction, and learn optimal reasoning–search interaction trajectories via multi-reward signals, improving response quality in complex logic- and knowledge-intensive tasks. R-Search guides the LLM to dynamically decide when to search or reason, while globally integrating key evidence to enhance deep knowledge interaction between reasoning and search. During RL training, R-Search provides multi-type rewards to jointly optimize the reasoning–search trajectory. Experiments on seven datasets show that R-Search significantly outperforms mainstream RAG baselines.
CheckRLM: Effective Knowledge–Thought Coherence Checking in Retrieval-Augmented Reasoning
Dingling Xu | Ruobing Wang | Qingfei Zhao | Yukun Yan | Zhichun Wang | Daren Zha | Shi Yu | Zhenghao Liu | Shuo Wang | Xu Han | Maosong Sun
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Dingling Xu | Ruobing Wang | Qingfei Zhao | Yukun Yan | Zhichun Wang | Daren Zha | Shi Yu | Zhenghao Liu | Shuo Wang | Xu Han | Maosong Sun
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reasoning Language Models (RLMs) have significantly improved performance on complex tasks by extending the reasoning chain. However, these chains are prone to containing factual errors, particularly in knowledge-intensive tasks. To address this issue, we propose **CheckRLM**, a framework that improves the reliability of the reasoning process through Retrieval-Augmented Generation (RAG) by timely checking and correcting factual errors. Specifically, CheckRLM extracts factual claims from the reasoning chain to identify and localize subtle knowledge inconsistencies during inference. Upon detection of errors, a refinement mechanism performs minimal-cost yet precise corrections by leveraging external knowledge, ensuring coherence between the reasoning chain and correct knowledge. Extensive experiments demonstrate that CheckRLM substantially outperforms existing baselines, exhibiting a strong capability to mitigate error accumulation in long-horizon reasoning with lower costs. The code and data are available at https://github.com/AI9Stars/CheckRLM.
2025
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework
Kunlun Zhu | Yifan Luo | Dingling Xu | Yukun Yan | Zhenghao Liu | Shi Yu | Ruobing Wang | Shuo Wang | Yishan Li | Nan Zhang | Xu Han | Zhiyuan Liu | Maosong Sun
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Kunlun Zhu | Yifan Luo | Dingling Xu | Yukun Yan | Zhenghao Liu | Shi Yu | Ruobing Wang | Shuo Wang | Yishan Li | Nan Zhang | Xu Han | Zhiyuan Liu | Maosong Sun
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Retrieval-Augmented Generation (RAG) is a powerful approach that enables large language models (LLMs) to incorporate external knowledge. However, evaluating the effectiveness of RAG systems in specialized scenarios remains challenging due to the high costs of data construction and the lack of suitable evaluation metrics. This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios by generating high-quality documents, questions, answers, and references through a schema-based pipeline. With a focus on factual accuracy, we propose three novel metrics—Completeness, Hallucination, and Irrelevance—to evaluate LLM-generated responses rigorously. Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples. Furthermore, the use of LLMs for scoring the proposed metrics demonstrates a high level of consistency with human evaluations. RAGEval establishes a new paradigm for evaluating RAG systems in real-world applications. The code and dataset are released at https://github.com/OpenBMB/RAGEval.