Liangxin Liu
2026
ConsistRM: Improving Generative Reward Models via Consistency-Aware Self-Training
Yu Liang | Liangxin Liu | Longzheng Wang | Wangyan | Zhang Yueyang | Long Xia | Zhiyuan Sun | Daiting Shi
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yu Liang | Liangxin Liu | Longzheng Wang | Wangyan | Zhang Yueyang | Long Xia | Zhiyuan Sun | Daiting Shi
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Generative reward models (GRMs) have emerged as a promising approach for aligning Large Language Models (LLMs) with human preferences by offering greater representational capacity and flexibility than traditional scalar reward models. However, GRMs face two major challenges: reliance on costly human-annotated data restricts scalability, and self-training approaches often suffer from instability and vulnerability to reward hacking. To address these issues, we propose ConsistRM, a self-training framework that enables effective and stable GRM training without human annotations. ConsistRM incorporates the Consistency-Aware Answer Reward, which produces reliable pseudo-labels with temporal consistency, thereby providing more stable model optimization. Moreover, the Consistency-Aware Critique Reward is introduced to assess semantic consistency across multiple critiques and allocates fine-grained and differentiated rewards. Experiments on five benchmark datasets across four base models demonstrate that ConsistRM outperforms vanilla Reinforcement Fine-Tuning (RFT) by an average of 1.5%. Further analysis shows that ConsistRM enhances output consistency and mitigates position bias caused by input order, highlighting the effectiveness of consistency-aware rewards in improving GRMs.Our implementation is available at https://github.com/yuliangCarmelo/ConsistRM.
ReflectRM: Boosting Generative Reward Models via Self-Reflection within a Unified Judgment Framework
Kai Qin | Liangxin Liu | Yu Liang | Longzheng Wang | Wangyan | Zhang Yueyang | Long Xia | Zhiyuan Sun | Houde Liu | Daiting Shi
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Kai Qin | Liangxin Liu | Yu Liang | Longzheng Wang | Wangyan | Zhang Yueyang | Long Xia | Zhiyuan Sun | Houde Liu | Daiting Shi
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reward Models (RMs) are critical components in the Reinforcement Learning from Human Feedback (RLHF) pipeline, directly determining the alignment quality of Large Language Models (LLMs). Recently, Generative Reward Models (GRMs) have emerged as a superior paradigm, offering higher interpretability and stronger generalization than traditional scalar RMs. However, existing methods for GRMs focus primarily on outcome-level supervision, neglecting analytical process quality, which constrains their potential. To address this, we propose ReflectRM, a novel GRM that leverages self-reflection to assess analytical quality and enhance preference modeling. ReflectRM is trained under a unified generative framework for joint modeling of response preference and analysis preference. During inference, we use its self-reflection capability to identify the most reliable analysis, from which the final preference prediction is derived. Experiments across four benchmarks show that ReflectRM consistently improves performance, achieving an average accuracy gain of +3.7 on Qwen3-4B. Further experiments confirm that response preference and analysis preference are mutually reinforcing. Notably, ReflectRM substantially mitigates positional bias, yielding +10.2 improvement compared with leading GRMs and establishing itself as a more stable evaluator. Our code is available at https://github.com/yuliangCarmelo/ReflectRM.
2024
Curriculum Consistency Learning for Conditional Sentence Generation
Liangxin Liu | Xuebo Liu | Lian Lian | Shengjun Cheng | Jun Rao | Tengfei Yu | Hexuan Deng | Min Zhang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Liangxin Liu | Xuebo Liu | Lian Lian | Shengjun Cheng | Jun Rao | Tengfei Yu | Hexuan Deng | Min Zhang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Consistency learning (CL) has proven to be a valuable technique for improving the robustness of models in conditional sentence generation (CSG) tasks by ensuring stable predictions across various input data forms. However, models augmented with CL often face challenges in optimizing consistency features, which can detract from their efficiency and effectiveness. To address these challenges, we introduce Curriculum Consistency Learning (CCL), a novel strategy that guides models to learn consistency in alignment with their current capacity to differentiate between features. CCL is designed around the inherent aspects of CL-related losses, promoting task independence and simplifying implementation. Implemented across four representative CSG tasks, including instruction tuning (IT) for large language models and machine translation (MT) in three modalities (text, speech, and vision), CCL demonstrates marked improvements. Specifically, it delivers +2.0 average accuracy point improvement compared with vanilla IT and an average increase of +0.7 in COMET scores over traditional CL methods in MT tasks. Our comprehensive analysis further indicates that models utilizing CCL are particularly adept at managing complex instances, showcasing the effectiveness and efficiency of CCL in improving CSG models. Code and scripts are available at https://github.com/xinxinxing/Curriculum-Consistency-Learning.