Hangfan Zhang
2025
Reinforcement Learning for Large Language Models via Group Preference Reward Shaping
Huaisheng Zhu
|
Siyuan Xu
|
Hangfan Zhang
|
Teng Xiao
|
Zhimeng Guo
|
Shijie Zhou
|
Shuyue Hu
|
Vasant G. Honavar
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large Language Models (LLMs) require alignment via reinforcement learning (RL) to effectively perform task-specific objectives, such as human preference alignment and enhanced reasoning. While Proximal Policy Optimization (PPO) is widely adopted, its computational overhead, stemming from additional value model requirements, limits applicability. Existing alternatives, like Group Relative Policy Optimization (GRPO), mitigate computational costs but remain sensitive to reward model quality. To address this, we introduce Group Preference Reward Shaping (GPRS), a novel method that leverages preference-based comparisons rather than precise numerical rewards. GPRS requires no extra model components and remains robust across varying reward model sizes and qualities. Extensive experiments demonstrate that GPRS consistently outperforms existing critic-model-free RL algorithms in Reinforcement Learning from Human Feedback (RLHF) and reasoning tasks, providing stable and good alignment performance.
2024
Jailbreak Open-Sourced Large Language Models via Enforced Decoding
Hangfan Zhang
|
Zhimeng Guo
|
Huaisheng Zhu
|
Bochuan Cao
|
Lu Lin
|
Jinyuan Jia
|
Jinghui Chen
|
Dinghao Wu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) have achieved unprecedented performance in Natural Language Generation (NLG) tasks. However, many existing studies have shown that they could be misused to generate undesired content. In response, before releasing LLMs for public access, model developers usually align those language models through Supervised Fine-Tuning (SFT) or Reinforcement Learning with Human Feedback (RLHF). Consequently, those aligned large language models refuse to generate undesired content when facing potentially harmful/unethical requests. A natural question is “could alignment really prevent those open-sourced large language models from being misused to generate undesired content?”. In this work, we provide a negative answer to this question. In particular, we show those open-sourced, aligned large language models could be easily misguided to generate undesired content without heavy computations or careful prompt designs. Our key idea is to directly manipulate the generation process of open-sourced LLMs to misguide it to generate undesired content including harmful or biased information and even private data. We evaluate our method on 4 open-sourced LLMs accessible publicly and our finding highlights the need for more advanced mitigation strategies for open-sourced LLMs.
Search
Fix author
Co-authors
- Zhimeng Guo 2
- Huaisheng Zhu 2
- Bochuan Cao 1
- Jinghui Chen 1
- Vasant G. Honavar 1
- show all...