AIPO: Adaptive Information Guided Token-Level Reinforcement Learning for Large Language Model Reasoning

Bin Chen, Hongfei Ye, Huiyang Wang, Wenxi Liu, Yu Zhang, Furui Liu


Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) improves the reasoning capability of Large Language Models (LLMs). Current RLVR trains LLMs on all generated tokens, rather than exploring which tokens actually contribute to reasoning. We propose AIPO(Adaptive–Information Policy Optimization), which focuses updates on those decisive tokens discovered on the fly. AIPO estimates each hidden state’s mutual information to score tokens. Policy gradients are then computed only on these critical tokens, using an advantage that blends information gain and verifiable correctness. To improve the efficiency of mutual-information estimation, AIPO adopts a Random–Fourier approximation of the Hilbert–Schmidt Independence Criterion. Across five math and science benchmarks, AIPO yields up to +20% accuracy over strong RLVR baselines while updating merely 10% of tokens, demonstrating superior efficiency and effectiveness. Our findings highlight the importance of information–driven token selection for efficient and effective reinforcement learning of LLM reasoning.
Anthology ID:
2026.acl-long.2057
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
44441–44450
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.2057/
DOI:
Bibkey:
Cite (ACL):
Bin Chen, Hongfei Ye, Huiyang Wang, Wenxi Liu, Yu Zhang, and Furui Liu. 2026. AIPO: Adaptive Information Guided Token-Level Reinforcement Learning for Large Language Model Reasoning. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 44441–44450, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
AIPO: Adaptive Information Guided Token-Level Reinforcement Learning for Large Language Model Reasoning (Chen et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.2057.pdf
Checklist:
 2026.acl-long.2057.checklist.pdf