Semi-Supervised Reward Modeling via Iterative Self-Training
Yifei He, Haoxiang Wang, Ziyan Jiang, Alexandros Papangelis, Han Zhao
Abstract
Reward models (RM) capture the values and preferences of humans and play a central role in Reinforcement Learning with Human Feedback (RLHF) to align pretrained large language models (LLMs). Traditionally, training these models relies on extensive human-annotated preference data, which poses significant challenges in terms of scalability and cost. To overcome these limitations, we propose Semi-Supervised Reward Modeling (SSRM), an approach that enhances RM training using unlabeled data. Given an unlabeled dataset, SSRM involves three key iterative steps: pseudo-labeling unlabeled examples, selecting high-confidence examples through a confidence threshold, and supervised finetuning on the refined dataset. Across extensive experiments on various model configurations, we demonstrate that SSRM significantly improves reward models without incurring additional labeling costs. Notably, SSRM can achieve performance comparable to models trained entirely on labeled data of equivalent volumes. Overall, SSRM substantially reduces the dependency on large volumes of human-annotated data, thereby decreasing the overall cost and time involved in training effective reward models.- Anthology ID:
- 2024.findings-emnlp.434
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2024
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 7365–7377
- Language:
- URL:
- https://preview.aclanthology.org/add-emnlp-2024-awards/2024.findings-emnlp.434/
- DOI:
- 10.18653/v1/2024.findings-emnlp.434
- Cite (ACL):
- Yifei He, Haoxiang Wang, Ziyan Jiang, Alexandros Papangelis, and Han Zhao. 2024. Semi-Supervised Reward Modeling via Iterative Self-Training. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 7365–7377, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- Semi-Supervised Reward Modeling via Iterative Self-Training (He et al., Findings 2024)
- PDF:
- https://preview.aclanthology.org/add-emnlp-2024-awards/2024.findings-emnlp.434.pdf