Bin Wang
Other people with similar names: Bin Wang, Bin Wang, Bin Wang
Unverified author pages with similar names: Bin Wang
2026
Exploration-Exploitation Reshaping towards Efficient Reasoning for Large Language Models
Yufeng Shi | Weilin Luo | Yuxiang Zhang | Zongmeng Zhang | Haoyang Liu | Yubing Wang | Bin Wang | Wengang Zhou | Houqiang Li
Findings of the Association for Computational Linguistics: ACL 2026
Yufeng Shi | Weilin Luo | Yuxiang Zhang | Zongmeng Zhang | Haoyang Liu | Yubing Wang | Bin Wang | Wengang Zhou | Houqiang Li
Findings of the Association for Computational Linguistics: ACL 2026
While excelling at solving complex problems, Large Reasoning Models (LRMs) are still constrained by the overthinking issue. Most current studies rely on reward shaping in Reinforcement Learning (RL) to shorten the Chain-of-Thought (CoT) of LRMs, remaining sample-inefficient and non-robust due to the absence of guided exploration and prioritized exploitation. To address these issues, we propose a novel policy optimization framework with **S**elf-**I**mitation and self-**G**uidance **M**ech**A**nisms (SIGMA), which reshapes the exploration and exploitation through two core components: (i) **self-imitation exploitation**, which enables the prioritized exploitation of high-value prompts and rollouts by introducing a self-imitated loss and a dynamic sampling strategy based on compression rate; (ii) **self-guidance exploration**, which provides a preference-aware exploration guidance through diverse and pluggable self-rewriting strategies. Experiments across various datasets indicate that our method achieves superior reasoning efficiency without compromising, and even facilitating, the overall accuracy. Furthermore, ablation studies show that the proposed mechanisms can provide flexible control interfaces for the tradeoff between the reasoning accuracy and efficiency of LRMs.