Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization

Huimin Xu, Shuai Zhao, Xiaobao Wu, Anh Tuan Luu


Abstract
Reinforcement learning with verifiable rewards (RLVR) has become an effective paradigm for improving the reasoning ability of large language models. However, widely used RLVR algorithms, such as GRPO, often suffer from entropy collapse, leading to premature determinism and unstable optimization. Existing remedies, including entropy regularization and ratio-based clipping heuristics, either control entropy in a coarse-grained manner or rely on approximate on-policy training. In this paper, we revisit entropy collapse from a token-level entropy flow perspective. Our analysis reveals that entropy-decreasing tokens consistently outweigh entropy-increasing ones, resulting in a severely imbalanced entropy flow. This perspective provides a unified explanation of entropy collapse in existing RLVR algorithms and highlights the importance of balancing entropy dynamics. Motivated by this analysis, we propose On-Policy Entropy Flow Optimization (OPEFO), an adaptive entropy flow balancing mechanism that rescales entropy-increasing and entropy-decreasing updates according to their contributions to entropy change, while remaining strict on-policy. Experiments on six mathematical reasoning benchmarks demonstrate that OPEFO improves training stability and final performance. We will release the code and models upon publication.
Anthology ID:
2026.findings-acl.879
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
17759–17771
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.879/
DOI:
Bibkey:
Cite (ACL):
Huimin Xu, Shuai Zhao, Xiaobao Wu, and Anh Tuan Luu. 2026. Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization. In Findings of the Association for Computational Linguistics: ACL 2026, pages 17759–17771, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization (Xu et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.879.pdf
Checklist:
 2026.findings-acl.879.checklist.pdf