Prior Constraints-based Reward Model Training for Aligning Large Language Models

Hang Zhou, Chenglong Wang, Yimin Hu, Tong Xiao, Chunliang Zhang, Jingbo Zhu


Abstract
“Reinforcement learning with human feedback for aligning large language models (LLMs) trainsa reward model typically using ranking loss with comparison pairs. However, the training pro-cedure suffers from an inherent problem: the uncontrolled scaling of reward scores during rein-forcement learning due to the lack of constraints while training the reward model. This paperproposes a Prior Constraints-based Reward Model (PCRM) training method to mitigate thisproblem. PCRM incorporates prior constraints—specifically, length ratio and cosine similaritybetween outputs of each comparison pair—during reward model training to regulate optimiza-tion magnitude and control score margins. We comprehensively evaluate PCRM by examining itsrank correlation with human preferences and its effectiveness in aligning LLMs via RL. Exper-imental results demonstrate that PCRM significantly improves alignment performance by effec-tively constraining reward score scaling. As another bonus, our method is easily integrated intoarbitrary rank-based alignment methods, such as direct preference optimization, and can yieldconsistent improvement. The code is available at https://github.com/wangclnlp/DeepSpeed-Chat-Extension/tree/PCRM.”
Anthology ID:
2024.ccl-1.107
Volume:
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)
Month:
July
Year:
2024
Address:
Taiyuan, China
Editors:
Sun Maosong, Liang Jiye, Han Xianpei, Liu Zhiyuan, He Yulan
Venue:
CCL
SIG:
Publisher:
Chinese Information Processing Society of China
Note:
Pages:
1395–1407
Language:
English
URL:
https://preview.aclanthology.org/name-variant-aaron-steven-white/2024.ccl-1.107/
DOI:
Bibkey:
Cite (ACL):
Hang Zhou, Chenglong Wang, Yimin Hu, Tong Xiao, Chunliang Zhang, and Jingbo Zhu. 2024. Prior Constraints-based Reward Model Training for Aligning Large Language Models. In Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference), pages 1395–1407, Taiyuan, China. Chinese Information Processing Society of China.
Cite (Informal):
Prior Constraints-based Reward Model Training for Aligning Large Language Models (Zhou et al., CCL 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/name-variant-aaron-steven-white/2024.ccl-1.107.pdf