Efficiently Learning To Reason or Not to Reason: Root-token Policy Optimization for Adaptive Thinking

Taehyeon Kim, Hyunsoo Lee, Youngsoo Jang, Moontae Lee


Abstract
Large reasoning models (LRMs) achieve strong performance by externalizing explicit reasoning traces before producing the answer, yet suffer from overthinking challenge that allocates uniformly heavy computation to queries of varying difficulty. While proprietary models mitigate this via opaque routing, open-source LRMs still lack an efficient mechanism to internalize adaptive reasoning due to both expensive training cost and limited disclosure of training recipes. In response, we introduce RPO (Root-token Policy Optimization), a framework that enables LRMs to self-determine when to reason by training only the initial root token (e.g., whether to invoke the think tag) via group relative reward and group-wise advantages. By focusing on this pivotal branching point, RPO drastically reduces training overhead and VRAM usage. Across multiple model families and scales, RPO learns difficulty-aware adaptive thinking at just 2% of the training compute of prior adaptive-reasoning methods.
Anthology ID:
2026.acl-long.816
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
17934–17949
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.816/
DOI:
Bibkey:
Cite (ACL):
Taehyeon Kim, Hyunsoo Lee, Youngsoo Jang, and Moontae Lee. 2026. Efficiently Learning To Reason or Not to Reason: Root-token Policy Optimization for Adaptive Thinking. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 17934–17949, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Efficiently Learning To Reason or Not to Reason: Root-token Policy Optimization for Adaptive Thinking (Kim et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.816.pdf
Checklist:
 2026.acl-long.816.checklist.pdf