Reinforcement Learning for Large Language Models via Group Preference Reward Shaping

Huaisheng Zhu; Siyuan Xu; Hangfan Zhang; Teng Xiao; Zhimeng Guo; Shijie Zhou; Shuyue Hu; Vasant G. Honavar

Reinforcement Learning for Large Language Models via Group Preference Reward Shaping

Huaisheng Zhu, Siyuan Xu, Hangfan Zhang, Teng Xiao, Zhimeng Guo, Shijie Zhou, Shuyue Hu, Vasant G. Honavar

Abstract

Large Language Models (LLMs) require alignment via reinforcement learning (RL) to effectively perform task-specific objectives, such as human preference alignment and enhanced reasoning. While Proximal Policy Optimization (PPO) is widely adopted, its computational overhead, stemming from additional value model requirements, limits applicability. Existing alternatives, like Group Relative Policy Optimization (GRPO), mitigate computational costs but remain sensitive to reward model quality. To address this, we introduce Group Preference Reward Shaping (GPRS), a novel method that leverages preference-based comparisons rather than precise numerical rewards. GPRS requires no extra model components and remains robust across varying reward model sizes and qualities. Extensive experiments demonstrate that GPRS consistently outperforms existing critic-model-free RL algorithms in Reinforcement Learning from Human Feedback (RLHF) and reasoning tasks, providing stable and good alignment performance.

Anthology ID:: 2025.emnlp-main.1085
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 21398–21411
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1085/
DOI:
Bibkey:
Cite (ACL):: Huaisheng Zhu, Siyuan Xu, Hangfan Zhang, Teng Xiao, Zhimeng Guo, Shijie Zhou, Shuyue Hu, and Vasant G. Honavar. 2025. Reinforcement Learning for Large Language Models via Group Preference Reward Shaping. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 21398–21411, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Reinforcement Learning for Large Language Models via Group Preference Reward Shaping (Zhu et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1085.pdf
Checklist:: 2025.emnlp-main.1085.checklist.pdf

PDF Cite Search Checklist Fix data