Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting

Cheng Wang; Qin Liu; Wenxuan Zhou; Muhao Chen

Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting

Cheng Wang, Qin Liu, Wenxuan Zhou, Muhao Chen

Abstract

Group Relative Policy Optimization (GRPO) has emerged as a promising approach for improving the reasoning capabilities of large language models. However, it struggles to effectively balance the trade-off between exploration and exploitation during training, often resulting in suboptimal performance. Motivated by the theoretical insight that changes in entropy are governed by the covariance between token probabilities and their corresponding advantages, we propose a hyperparameter-free, covariance-weighted optimization method that dynamically down-weights extreme token-level updates via a Gaussian kernel. This approach automatically reduces the instability caused by the exploration-exploitation trade-off while preserving informative learning signals. Extensive empirical evaluations show that our approach improves downstream performance across reasoning benchmarks compared with GRPO, and effectively stabilizes entropy as training progresses.

Anthology ID:: 2026.acl-short.45
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 540–546
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-short.45/
DOI:
Bibkey:
Cite (ACL):: Cheng Wang, Qin Liu, Wenxuan Zhou, and Muhao Chen. 2026. Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 540–546, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting (Wang et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-short.45.pdf
Checklist:: 2026.acl-short.45.checklist.pdf

PDF Cite Search Checklist Fix data