Step-GRPO: Internalizing Dynamic Early Exit for Efficient Reasoning

Benteng Chen; Weida Wang; Shufei Zhang; Mingbao Lin; Min Zhang

Step-GRPO: Internalizing Dynamic Early Exit for Efficient Reasoning

Benteng Chen, Weida Wang, Shufei Zhang, Mingbao Lin, Min Zhang

Abstract

Large reasoning models that use long chain-of-thought excel at problem-solving yet waste compute on redundant checks. Curbing this overthinking is hard: training-time length penalties can cripple ability, while inference-time early-exit adds system overhead. To bridge this gap, we propose **Step-GRPO**, a novel post-training framework that internalizes dynamic early-exit capabilities directly into the model. Step-GRPO shifts the optimization objective from raw tokens to semantic steps by utilizing linguistic markers to structure reasoning. We introduce a Dynamic Truncated Rollout mechanism that exposes the model to concise high-confidence trajectories during exploration, synergized with a Step-Aware Relative Reward that dynamically penalizes redundancy based on group-level baselines. Extensive experiments across three model sizes on diverse benchmarks demonstrate that Step-GRPO achieves a superior accuracy-efficiency trade-off. On Qwen3-8B, our method reduces token consumption by 32.0% compared to the vanilla model while avoiding the accuracy degradation observed in traditional length-penalty methods.

Anthology ID:: 2026.acl-long.990
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 21710–21724
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.990/
DOI:
Bibkey:
Cite (ACL):: Benteng Chen, Weida Wang, Shufei Zhang, Mingbao Lin, and Min Zhang. 2026. Step-GRPO: Internalizing Dynamic Early Exit for Efficient Reasoning. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21710–21724, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Step-GRPO: Internalizing Dynamic Early Exit for Efficient Reasoning (Chen et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.990.pdf
Checklist:: 2026.acl-long.990.checklist.pdf

PDF Cite Search Checklist Fix data