Beyond Excess and Deficiency: Adaptive Length Bias Mitigation in Reward Models for RLHF

Yuyan Bu, Liangyu Huo, Yi Jing, Qing Yang


Abstract
Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning large language models (LLMs) with human values. However, it has been noted that reward models in RLHF often exhibit unintended biases, such as an overemphasis on response length based on the erroneous assumption that longer responses are universally preferred. This “length bias” can lead to excessively verbose responses that compromise the quality of LLMs alignment. Previous efforts to mitigate length bias in reward models have inadvertently decreased their accuracy by neglecting the legitimate influence of response length on human preferences. In this work, we argue that response length is a context-specific factor in human evaluations, with different queries naturally eliciting varying preferences for response length. We propose an adaptive approach to modeling length preference that dynamically adjusts the influence of response length in reward evaluations according to the context of the query. Experimental results demonstrate that our adaptive approach effectively balances the mitigation of undesired length hacking and alignment accuracy, reducing unnecessary verbosity while improving overall response quality.
Anthology ID:
2025.findings-naacl.169
Volume:
Findings of the Association for Computational Linguistics: NAACL 2025
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3091–3098
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.169/
DOI:
Bibkey:
Cite (ACL):
Yuyan Bu, Liangyu Huo, Yi Jing, and Qing Yang. 2025. Beyond Excess and Deficiency: Adaptive Length Bias Mitigation in Reward Models for RLHF. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 3091–3098, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Beyond Excess and Deficiency: Adaptive Length Bias Mitigation in Reward Models for RLHF (Bu et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.169.pdf