Towards Reward Fairness in RLHF: From a Resource Allocation Perspective

Sheng Ouyang; Yulan Hu; Ge Chen; Qingyang Li; Fuzheng Zhang; Yong Liu

Towards Reward Fairness in RLHF: From a Resource Allocation Perspective

Sheng Ouyang, Yulan Hu, Ge Chen, Qingyang Li, Fuzheng Zhang, Yong Liu

Abstract

Rewards serve as proxies for human preferences and play a crucial role in Reinforcement Learning from Human Feedback (RLHF). However, if these rewards are inherently imperfect, exhibiting various biases, they can adversely affect the alignment of large language models (LLMs). In this paper, we collectively define the various biases present in rewards as the problem of reward unfairness. We propose a bias-agnostic method to address the issue of reward fairness from a resource allocation perspective, without specifically designing for each type of bias, yet effectively mitigating them. Specifically, we model preference learning as a resource allocation problem, treating rewards as resources to be allocated while considering the trade-off between utility and fairness in their distribution. We propose two methods, Fairness Regularization and Fairness Coefficient, to achieve fairness in rewards. We apply our methods in both verification and reinforcement learning scenarios to obtain a fairness reward model and a policy model, respectively. Experiments conducted in these scenarios demonstrate that our approach aligns LLMs with human preferences in a more fair manner. Our data and code are available athttps://github.com/shoyua/Towards-Reward-Fairness.

Anthology ID:: 2025.acl-long.163
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3247–3259
Language:
URL:: https://preview.aclanthology.org/landing_page/2025.acl-long.163/
DOI:
Bibkey:
Cite (ACL):: Sheng Ouyang, Yulan Hu, Ge Chen, Qingyang Li, Fuzheng Zhang, and Yong Liu. 2025. Towards Reward Fairness in RLHF: From a Resource Allocation Perspective. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3247–3259, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Towards Reward Fairness in RLHF: From a Resource Allocation Perspective (Ouyang et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2025.acl-long.163.pdf

PDF Cite Search Fix data