Exploration-Exploitation Reshaping towards Efficient Reasoning for Large Language Models

Yufeng Shi, Weilin Luo, Yuxiang Zhang, Zongmeng Zhang, Haoyang Liu, Yubing Wang, Bin Wang, Wengang Zhou, Houqiang Li


Abstract
While excelling at solving complex problems, Large Reasoning Models (LRMs) are still constrained by the overthinking issue. Most current studies rely on reward shaping in Reinforcement Learning (RL) to shorten the Chain-of-Thought (CoT) of LRMs, remaining sample-inefficient and non-robust due to the absence of guided exploration and prioritized exploitation. To address these issues, we propose a novel policy optimization framework with **S**elf-**I**mitation and self-**G**uidance **M**ech**A**nisms (SIGMA), which reshapes the exploration and exploitation through two core components: (i) **self-imitation exploitation**, which enables the prioritized exploitation of high-value prompts and rollouts by introducing a self-imitated loss and a dynamic sampling strategy based on compression rate; (ii) **self-guidance exploration**, which provides a preference-aware exploration guidance through diverse and pluggable self-rewriting strategies. Experiments across various datasets indicate that our method achieves superior reasoning efficiency without compromising, and even facilitating, the overall accuracy. Furthermore, ablation studies show that the proposed mechanisms can provide flexible control interfaces for the tradeoff between the reasoning accuracy and efficiency of LRMs.
Anthology ID:
2026.findings-acl.1520
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
30392–30407
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1520/
DOI:
Bibkey:
Cite (ACL):
Yufeng Shi, Weilin Luo, Yuxiang Zhang, Zongmeng Zhang, Haoyang Liu, Yubing Wang, Bin Wang, Wengang Zhou, and Houqiang Li. 2026. Exploration-Exploitation Reshaping towards Efficient Reasoning for Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2026, pages 30392–30407, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Exploration-Exploitation Reshaping towards Efficient Reasoning for Large Language Models (Shi et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1520.pdf
Checklist:
 2026.findings-acl.1520.checklist.pdf