BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization

Saket Reddy, Ke Yang, ChengXiang Zhai


Abstract
Mitigating social bias in Large Language Models (LLMs) presents a distinct alignment challenge: unlike verifiable tasks, social bias lacks a single ground truth, creating a high-variance, subjective reward landscape. Previous preference-based fine-tuning methods have major trade-offs: Direct Preference Optimization (DPO) is limited by the lack of exploration inherent in offline training, while Proximal Policy Optimization (PPO) can lead to training instability due to potentially unreliable critic estimates. In this paper, we propose BiasGRPO, an adaptation of Group Relative Policy Optimization (GRPO) that stabilizes alignment by normalizing rewards across a group of sampled completions. By substituting the value function with a group-relative baseline, our approach reduces instability while maintaining the exploration benefits of online reinforcement learning. To adapt GRPO, we curate and synthetically extend a dataset spanning multiple domains and contexts, and create a custom, bias-specific reward model for effectively guiding generation while avoiding knowledge degradation. We find that BiasGRPO outperforms DPO and PPO across multiple benchmarks, indicating its effectiveness as an alignment technique that can overcome the limitations of previous preference-based methods.
Anthology ID:
2026.findings-acl.2052
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
41250–41267
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2052/
DOI:
Bibkey:
Cite (ACL):
Saket Reddy, Ke Yang, and ChengXiang Zhai. 2026. BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization. In Findings of the Association for Computational Linguistics: ACL 2026, pages 41250–41267, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization (Reddy et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2052.pdf
Checklist:
 2026.findings-acl.2052.checklist.pdf