2025
pdf
bib
abs
RankCoT: Refining Knowledge for Retrieval-Augmented Generation through Ranking Chain-of-Thoughts
Mingyan Wu
|
Zhenghao Liu
|
Yukun Yan
|
Xinze Li
|
Shi Yu
|
Zheni Zeng
|
Yu Gu
|
Ge Yu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Retrieval-Augmented Generation (RAG) enhances the performance of Large Language Models (LLMs) by incorporating external knowledge. However, LLMs still encounter challenges in effectively utilizing the knowledge from retrieved documents, often being misled by irrelevant or noisy information. To address this issue, we introduce RankCoT, a knowledge refinement method that incorporates reranking signals in generating CoT-based summarization for knowledge refinement based on given query and all retrieval documents. During training, RankCoT prompts the LLM to generate Chain-of-Thought (CoT) candidates based on the query and individual documents. It then fine-tunes the LLM to directly reproduce the best CoT from these candidate outputs based on all retrieved documents, which requires LLM to filter out irrelevant documents during generating CoT-style summarization. Additionally, RankCoT incorporates a self-reflection mechanism that further refines the CoT outputs, resulting in higher-quality training data. Our experiments demonstrate the effectiveness of RankCoT, showing its superior performance over other knowledge refinement models. Further analysis reveals that RankCoT can provide shorter but effective refinement results, enabling the generator to produce more accurate answers. All code and data are available at https://github.com/NEUIR/RankCoT.
pdf
bib
abs
Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models
Shuliang Liu
|
Xinze Li
|
Zhenghao Liu
|
Yukun Yan
|
Cheng Yang
|
Zheni Zeng
|
Zhiyuan Liu
|
Maosong Sun
|
Ge Yu
Findings of the Association for Computational Linguistics: ACL 2025
Retrieval-Augmented Generation (RAG) has proven its effectiveness in alleviating hallucinations for Large Language Models (LLMs). However, existing automated evaluation metrics cannot fairly evaluate the outputs generated by RAG models during training and evaluation. LLM-based judgment models provide the potential to produce high-quality judgments, but they are highly sensitive to evaluation prompts, leading to inconsistencies when judging the output of RAG models. This paper introduces the Judge-Consistency (ConsJudge) method, which aims to enhance LLMs to generate more accurate evaluations for RAG models. Specifically, ConsJudge prompts LLMs to generate different judgments based on various combinations of judgment dimensions, utilizes the judge-consistency to evaluate these judgments, and selects the chosen and rejected judgments for DPO training. Our experiments show that ConsJudge can effectively provide more accurate judgments for optimizing RAG models across various RAG models and datasets. Further analysis reveals that judgments generated by ConsJudge have a high agreement with the superior LLM. All codes are available at https://github.com/OpenBMB/ConsJudge.
pdf
bib
abs
PersLLM: A Personified Training Approach for Large Language Models
Zheni Zeng
|
Jiayi Chen
|
Huimin Chen
|
Yukun Yan
|
Yuxuan Chen
|
Zhenghao Liu
|
Zhiyuan Liu
|
Maosong Sun
Findings of the Association for Computational Linguistics: EMNLP 2025
Large language models (LLMs) exhibit human-like intelligence, enabling them to simulate human behavior and support various applications that require both humanized communication and extensive knowledge reserves. Efforts are made to personify LLMs with special training data or hand-crafted prompts, while correspondingly faced with challenges such as insufficient data usage or rigid behavior patterns. Consequently, personified LLMs fail to capture personified knowledge or express persistent opinion. To fully unlock the potential of LLM personification, we propose PersLLM, a framework for better data construction and model tuning. For insufficient data usage, we incorporate strategies such as Chain-of-Thought prompting and anti-induction, improving the quality of data construction and capturing the personality experiences, knowledge, and thoughts more comprehensively. For rigid behavior patterns, we design the tuning process and introduce automated DPO to enhance the specificity and dynamism of the models’ personalities, which leads to a more natural opinion communication. Both automated metrics and expert human evaluations demonstrate the effectiveness of our approach. Case studies in human-machine interactions and multi-agent systems further suggest potential application scenarios and future directions for LLM personification.
pdf
bib
abs
KBAlign: Efficient Self Adaptation on Specific Textual Knowledge Bases
Zheni Zeng
|
Yuxuan Chen
|
Shi Yu
|
Ruobing Wang
|
Yukun Yan
|
Zhenghao Liu
|
Shuo Wang
|
Xu Han
|
Zhiyuan Liu
|
Maosong Sun
Findings of the Association for Computational Linguistics: EMNLP 2025
Although retrieval-augmented generation (RAG) remains essential for knowledge-based question answering (KBQA), current paradigms face critical challenges under specific domains. Existing methods struggle with targeted adaptation on small-scale KBs: vanilla unsupervised training exhibits poor effectiveness, while fine-tuning incurs prohibitive costs of external signals. We present KBAlign, a self-supervised framework that enhances RAG systems through efficient model adaptation. Our key insight is to leverage the model’s intrinsic capabilities for knowledge alignment through two innovative mechanisms: multi-grained self-annotation that captures global knowledge for data construction, and iterative tuning that accelerates convergence through self verification. This framework enables cost-effective model adaptation to specific textual KBs, without human supervision or external model assistance. Experiments demonstrate that KBAlign can achieve 90% of the performance gain obtained through GPT-4-supervised adaptation, while relying entirely on self-annotation of much smaller models. KBAlign significantly improves downstream QA accuracy across multiple domains with tiny costs, particularly benefiting scenarios requiring deep knowledge integration from specialized corpora. We release our experimental data, models, and process analyses to the community for further exploration(https://anonymous.4open.science/r/KBAlign-D160).