Senjie Jin
2024
Improving Discriminative Capability of Reward Models in RLHF Using Contrastive Learning
Lu Chen
|
Rui Zheng
|
Binghai Wang
|
Senjie Jin
|
Caishuang Huang
|
Junjie Ye
|
Zhihao Zhang
|
Yuhao Zhou
|
Zhiheng Xi
|
Tao Gui
|
Qi Zhang
|
Xuanjing Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Reinforcement Learning from Human Feedback (RLHF) is a crucial approach to aligning language models with human values and intentions. A fundamental challenge in this method lies in ensuring that the reward model accurately understands and evaluates human preferences. Current methods rely on ranking losses to teach the reward model to assess preferences, but they are susceptible to noise and ambiguous data, often failing to deeply understand human intentions. To address this issue, we introduce contrastive learning into the reward modeling process. In addition to supervised ranking loss, we introduce an unsupervised contrastive loss to enable the reward model to fully capture the distinctions in contrastive data. Experimental results demonstrate that the proposed contrastive learning-based reward modeling method effectively enhances the generalization of the reward model, stabilizes the reinforcement learning training process, and improves the final alignment with human preferences.
2023
Self-Polish: Enhance Reasoning in Large Language Models via Problem Refinement
Zhiheng Xi
|
Senjie Jin
|
Yuhao Zhou
|
Rui Zheng
|
Songyang Gao
|
Jia Liu
|
Tao Gui
|
Qi Zhang
|
Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2023
To enhance the multi-step reasoning capabilities of large language models, researchers have extensively explored prompting methods, notably the Chain-of-Thought (CoT) method which explicitly elicits human-like rationales. However, they have inadvertently overlooked the potential of enhancing model reasoning performance by formulating higher-quality problems. In this work, we start from the problem side and propose Self-Polish (SP), a novel method that facilitates the model’s reasoning by guiding it to progressively refine the given problems to be more comprehensible and solvable. We also explore several automatic prompting varients and propose the Self-Polish prompt bank for the community. SP is orthogonal to all other prompting methods of answer/reasoning side like CoT, allowing for seamless integration with state-of-the-art techniques for further improvement. Thorough experiments show that the proposed method attains notable and consistent effectiveness on five reasoning benchmarks across different models. Furthermore, our method also showcases impressive performance on robustness evaluation. Codes and prompts are available at https://github.com/WooooDyy/Self-Polish.
Search
Co-authors
- Binghai Wang 1
- Caishuang Huang 1
- Jia Liu 1
- Junjie Ye (叶俊杰) 1
- Lu Chen 1
- show all...