Towards Pareto-Efficient RLHF: Paying Attention to a Few High-Reward Samples with Reward Dropout

Changhun Lee; Chiehyeon Lim

doi:10.18653/v1/2024.findings-emnlp.489

Towards Pareto-Efficient RLHF: Paying Attention to a Few High-Reward Samples with Reward Dropout

Abstract

Recently, leveraging reinforcement learning (RL) to fine-tune language models (LMs), known as reinforcement learning from human feedback (RLHF), has become an important research topic. However, there is still a lack of theoretical understanding of how RLHF works, the conditions under which it succeeds or fails, and whether it guarantees optimization of both likelihood 𝛽(⋅) and reward R(⋅) objectives. To address these issues, we consider RLHF as a bi-objective problem that has the nature of a Pareto optimization, present a Pareto improvement condition that is necessary to obtain Pareto-efficient policies, and propose a simple yet powerful method named reward dropout that guarantees a Pareto improvement. To demonstrate the performance of reward dropout, two benchmark datasets commonly used in text style transfer tasks were utilized in our study: sentiment and topic datasets sourced from Yelp and AG_News, respectively. Our experiments highlight that paying attention to a few samples with higher rewards leads to greater Pareto improvements regardless of model size. We also demonstrate that the effect of reward dropout is generalizable and most effective with non-pretrained target models, saving the effort of pretraining.

Anthology ID:: 2024.findings-emnlp.489
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2024
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8335–8349
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2024.findings-emnlp.489/
DOI:: 10.18653/v1/2024.findings-emnlp.489
Bibkey:
Cite (ACL):: Changhun Lee and Chiehyeon Lim. 2024. Towards Pareto-Efficient RLHF: Paying Attention to a Few High-Reward Samples with Reward Dropout. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 8335–8349, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Towards Pareto-Efficient RLHF: Paying Attention to a Few High-Reward Samples with Reward Dropout (Lee & Lim, Findings 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2024.findings-emnlp.489.pdf

PDF Cite Search Fix data