2025
pdf
bib
abs
Cheems: A Practical Guidance for Building and Evaluating Chinese Reward Models from Scratch
Xueru Wen
|
Jie Lou
|
Zichao Li
|
Yaojie Lu
|
XingYu XingYu
|
Yuqiu Ji
|
Guohai Xu
|
Hongyu Lin
|
Ben He
|
Xianpei Han
|
Le Sun
|
Debing Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reward models (RMs) are crucial for aligning large language models (LLMs) with human preferences. However, most RM research is centered on English and relies heavily on synthetic resources, which leads to limited and less reliable datasets and benchmarks for Chinese. To address this gap, we introduce CheemsBench, a fully human-annotated RM evaluation benchmark within Chinese contexts, and CheemsPreference, a large-scale and diverse preference dataset annotated through human-machine collaboration to support Chinese RM training. We systematically evaluate open-source discriminative and generative RMs on CheemsBench and observe significant limitations in their ability to capture human preferences in Chinese scenarios. Additionally, based on CheemsPreference, we construct an RM that achieves state-of-the-art performance on CheemsBench, demonstrating the necessity of human supervision in RM training. Our findings reveal that scaled AI-generated data struggles to fully capture human preferences, emphasizing the importance of high-quality human supervision in RM development.
pdf
bib
abs
Critic-CoT: Boosting the Reasoning Abilities of Large Language Model via Chain-of-Thought Critic
Xin Zheng
|
Jie Lou
|
Boxi Cao
|
Xueru Wen
|
Yuqiu Ji
|
Hongyu Lin
|
Yaojie Lu
|
Xianpei Han
|
Debing Zhang
|
Le Sun
Findings of the Association for Computational Linguistics: ACL 2025
Self-critic has become a crucial mechanism for enhancing the reasoning performance of LLMs. However, current approaches mainly involve basic prompts for intuitive instance-level feedback, which resembles System-1 processes and limits the reasoning capabilities. Moreover, there is a lack of in-depth investigations into the relationship between LLM’s ability to criticize and its task-solving performance. To address these issues, we propose Critic-CoT, a novel framework that pushes LLMs toward System-2-like critic capability. Through a step-wise CoT reasoning paradigm and the automatic construction of weak-supervision data without human annotation, Critic-CoT enables LLMs to engage in slow, analytic self-critique and refinement, thereby improving their reasoning abilities. Experiments on GSM8K and MATH and out-of-domain evaluation demonstrate that our enhanced model significantly boosts task-solving performance by filtering out invalid solutions or iterative refinement. Furthermore, we investigate the intrinsic correlation between critique and task-solving abilities within LLMs, discovering that these abilities can mutually reinforce each other rather than conflict.
pdf
bib
abs
On-Policy Self-Alignment with Fine-grained Knowledge Feedback for Hallucination Mitigation
Xueru Wen
|
Jie Lou
|
Xinyu Lu
|
Yuqiu Ji
|
Xinyan Guan
|
Yaojie Lu
|
Hongyu Lin
|
Ben He
|
Xianpei Han
|
Debing Zhang
|
Le Sun
Findings of the Association for Computational Linguistics: ACL 2025
Hallucination occurs when large language models exhibit behavior that deviates from the boundaries of their knowledge during response generation. To address this critical issue, previous learning-based methods attempt to finetune models but are limited by off-policy sampling and coarse-grained feedback. In this paper, we present Reinforcement Learning for Hallucination (RLFH), an on-policy self-alignment approach that enables LLMs to actively explore their knowledge boundaries and self-correct generation behavior through fine-grained feedback signals. RLFH introduces a self-assessment framework where the policy serves as its own judge. Through this framework, responses are automatically decomposed into atomic facts and their truthfulness and informativeness are assessed against external knowledge sources. The resulting fine-grained feedback at the statement level are then converted into token-level dense reward signals. This enables online reinforcement learning to achieve precise and timely optimization without human intervention. Comprehensive evaluations on HotpotQA, SQuADv2, and Biography benchmarks validate RLFH’s effectiveness in hallucination mitigation.
2023
pdf
bib
abs
Learning In-context Learning for Named Entity Recognition
Jiawei Chen
|
Yaojie Lu
|
Hongyu Lin
|
Jie Lou
|
Wei Jia
|
Dai Dai
|
Hua Wu
|
Boxi Cao
|
Xianpei Han
|
Le Sun
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Named entity recognition in real-world applications suffers from the diversity of entity types, the emergence of new entity types, and the lack of high-quality annotations. To address the above problems, this paper proposes an in-context learning-based NER approach, which can effectively inject in-context NER ability into PLMs and recognize entities of novel types on-the-fly using only a few demonstrative instances. Specifically, we model PLMs as a meta-function Lambda_instruction, demonstrations, text.M, and a new entity extractor can be implicitly constructed by applying new instruction and demonstrations to PLMs, i.e., (Lambda . M) (instruction, demonstrations) ->F where F will be a new entity extractor F: text -> entities. To inject the above in-context NER ability into PLMs, we propose a meta-function pre-training algorithm, which pre-trains PLMs by comparing the (instruction, demonstration)-initialized extractor with a surrogate golden extractor. Experimental results on 4 few-shot NER datasets show that our method can effectively inject in-context NER ability into PLMs and significantly outperforms the PLMs+fine-tuning counterparts.