Binghai Wang
2026
Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models
Binghai Wang | Yantao Liu | Yuxuan Liu | Tianyi Tang | Shenzhi Wang | Chang Gao | Chujie Zheng | Yichang Zhang | Le Yu | Shixuan Liu | Tao Gui | Qi Zhang | Xuanjing Huang | Bowen Yu | Fei Huang | Junyang Lin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Binghai Wang | Yantao Liu | Yuxuan Liu | Tianyi Tang | Shenzhi Wang | Chang Gao | Chujie Zheng | Yichang Zhang | Le Yu | Shixuan Liu | Tao Gui | Qi Zhang | Xuanjing Huang | Bowen Yu | Fei Huang | Junyang Lin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Generative Reward Models (GenRMs) and LLM-as-a-Judge exhibit deceptive alignment by producing correct judgments for incorrect reasons, as they are trained and evaluated to prioritize Outcome Accuracy, which undermines their ability to generalize during RLHF. We introduce Rationale Consistency, a fine-grained metric that quantifies the alignment between the model’s reasoning process and human judgment. Our evaluation of frontier models reveals that rationale consistency effectively discriminates among state-of-the-art models and detects deceptive alignment, while outcome accuracy falls short in both respects. To mitigate this gap, we introduce a hybrid signal that combines rationale consistency with outcome accuracy for GenRM training. Our training method achieves state-of-the-art performance on RM-Bench (87.1%) and JudgeBench (82%), surpassing outcome-only baselines by an average of 5%. Using RM during RLHF, our method effectively improves performance as demonstrated on Arena Hard v2, notably yielding a 7% improvement in creative writing tasks. Further analysis confirms that our method escapes the deceptive alignment trap, effectively reversing the decline in rationale consistency observed in outcome-only training.
2024
Reward Modeling Requires Automatic Adjustment Based on Data Quality
Binghai Wang | Rui Zheng | Lu Chen | Zhiheng Xi | Wei Shen | Yuhao Zhou | Dong Yan | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2024
Binghai Wang | Rui Zheng | Lu Chen | Zhiheng Xi | Wei Shen | Yuhao Zhou | Dong Yan | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2024
In Reinforcement Learning from Human Feedback (RLHF), the reward model plays a crucial role in aligning language model outputs with human values. The human preference data used to train the reward model consists of a prompt and a response pair, with humans annotating which response better aligns with human value preferences. Due to the complexity and subjectivity of the annotation task, multiple organizations including OpenAI and Anthropic report significant noise in the human preference datasets, leading to instability and deviation in reward model training from human values. We discover that the difference in scores assigned to response pairs by the reward model effectively indicates the quality of data, and data of varying qualities show significant distinctions in reward model training. We introduce a method that automatically adjusts reward modeling based on data quality, reducing the impact of noise and making full use of dataset. Experiments on multiple human preference datasets demonstrate that our method stabilizes reward model training and significantly enhances the alignment performance of RLHF.
Improving Discriminative Capability of Reward Models in RLHF Using Contrastive Learning
Lu Chen | Rui Zheng | Binghai Wang | Senjie Jin | Caishuang Huang | Junjie Ye | Zhihao Zhang | Yuhao Zhou | Zhiheng Xi | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Lu Chen | Rui Zheng | Binghai Wang | Senjie Jin | Caishuang Huang | Junjie Ye | Zhihao Zhang | Yuhao Zhou | Zhiheng Xi | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Reinforcement Learning from Human Feedback (RLHF) is a crucial approach to aligning language models with human values and intentions. A fundamental challenge in this method lies in ensuring that the reward model accurately understands and evaluates human preferences. Current methods rely on ranking losses to teach the reward model to assess preferences, but they are susceptible to noise and ambiguous data, often failing to deeply understand human intentions. To address this issue, we introduce contrastive learning into the reward modeling process. In addition to supervised ranking loss, we introduce an unsupervised contrastive loss to enable the reward model to fully capture the distinctions in contrastive data. Experimental results demonstrate that the proposed contrastive learning-based reward modeling method effectively enhances the generalization of the reward model, stabilizes the reinforcement learning training process, and improves the final alignment with human preferences.