Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via sequence-level likelihood

Xingyu Lin; Yilin Wen; Du Su; En Wang; Wenbin Liu; Zhonghou Lv; Jinchang Hou; Chenfu Bao

Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via sequence-level likelihood

Xingyu Lin, Yilin Wen, Du Su, En Wang, Wenbin Liu, Zhonghou Lv, Jinchang Hou, Chenfu Bao

Abstract

Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly in their mathemat- ical reasoning performance. However, GRPO and related entropy regularization methods still struggle with token-level sparse-rewards, which is an inherent challenge in chain-of-thought (CoT) reasoning. These approaches often rely on undifferentiated token-level entropy regu- larization, which easily leads to entropy collapse or model degradation under sparse token rewards. In this work, we propose TEPO, a novel token-level framework that (1) leverages sequence-level likelihood to link group-level rewards with individual tokens via token-level aggregation, and (2) introduces a token-level KL-Divergence mask constraint that targets tokens with positive advantages and decreasing entropy to mitigate abrupt policy updates. Experiments demonstrate that TEPO not only achieves state-of-the-art performance on mathematical reasoning benchmarks but also markedly enhances training stability, reducing convergence time by 50% compared with GRPO/DAPO.

Anthology ID:: 2026.acl-long.1488
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 32256–32269
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1488/
DOI:
Bibkey:
Cite (ACL):: Xingyu Lin, Yilin Wen, Du Su, En Wang, Wenbin Liu, Zhonghou Lv, Jinchang Hou, and Chenfu Bao. 2026. Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via sequence-level likelihood. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 32256–32269, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via sequence-level likelihood (Lin et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1488.pdf
Checklist:: 2026.acl-long.1488.checklist.pdf

PDF Cite Search Checklist Fix data