MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting

Kangda Wei; Ruihong Huang

MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting

Abstract

Group Relative Policy Optimization (GRPO) has become a standard approach for training mathematical reasoning models; however, GRPO training is computationally intensive and usually takes a long time, which consumes substantial computational resources and creates barriers for academic researchers and smaller organizations with limited GPU budgets. In this paper, we propose MMR-GRPO to accelerate GRPO training and reduce the overall training time required to reach peak performance, and the approach adopts Maximal Marginal Relevanceto reweigh rewards of multiple rollouts by balancing rollout quality with diversity to reduce rollout redundancy. The rationale is that redundant or similar completions, if repeatedly used to train a model, will create an “exploitation trap” and slow down model convergence in GRPO style reinforcement learning. Extensive evaluations across three model sizes (1.5B, 7B, 8B), three GRPO variants, and five mathematical reasoning benchmarks show that MMR-GRPO achieves comparable peak performance while requiring on average 47.9% fewer training steps and 70.2% less wall-clock time. These gains are consistent across models, methods, and benchmarks. Our code is released at: https://github.com/WeiKangda/MMR-GRPO.

Anthology ID:: 2026.findings-acl.467
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9584–9605
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.467/
DOI:
Bibkey:
Cite (ACL):: Kangda Wei and Ruihong Huang. 2026. MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting. In Findings of the Association for Computational Linguistics: ACL 2026, pages 9584–9605, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting (Wei & Huang, Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.467.pdf
Checklist:: 2026.findings-acl.467.checklist.pdf

PDF Cite Search Checklist Fix data