Danlong Yuan
2026
Shorten After You’re Right: Lazy Length Penalties for Reasoning RL
Danlong Yuan | Tian Xie | Shaohan Huang | Huishuai Zhang | Zhuocheng Gong | Chong Luo | Furu Wei | Dongyan Zhao
Findings of the Association for Computational Linguistics: ACL 2026
Danlong Yuan | Tian Xie | Shaohan Huang | Huishuai Zhang | Zhuocheng Gong | Chong Luo | Furu Wei | Dongyan Zhao
Findings of the Association for Computational Linguistics: ACL 2026
Long-reasoning models achieve strong accuracy on complex reasoning tasks, but their extended reasoning trajectories incur substantial memory and latency costs. Several existing shortening methods rely on additional supervision or multi-stage post-training, which primarily reduces inference length and does not reduce the rollout tokens during on-policy reinforcement learning (RL). We instead target on-policy response shortening, aiming to improve both inference efficiency and RL training throughput. However, because on-policy RL couples optimization with exploration, naively penalizing length can destabilize training and suppress exploration. To impose length pressure safely, we propose a lazy length penalty integrated into the rule-based RL pipeline: it activates only on correct trajectories, only after training accuracy enters a stably improving regime, and only when responses exceed a tolerance band beyond the minimal correct length. Across four settings, our method significantly reduces response length without extra training stages while maintaining or improving performance. In a logic reasoning setting, we achieve a 40% reduction in step-averaged response length alongside a 14-point gain in performance. For math problems, we reduce step-averaged response length by 33% while preserving performance.
2025
ReMamba: Equip Mamba with Effective Long-Sequence Modeling
Danlong Yuan | Jiahao Liu | Bei Li | Huishuai Zhang | Jingang Wang | Xunliang Cai | Dongyan Zhao
Findings of the Association for Computational Linguistics: EMNLP 2025
Danlong Yuan | Jiahao Liu | Bei Li | Huishuai Zhang | Jingang Wang | Xunliang Cai | Dongyan Zhao
Findings of the Association for Computational Linguistics: EMNLP 2025
While the Mamba architecture demonstrates superior inference efficiency and competitive performance on short-context natural language processing (NLP) tasks, empirical evidence suggests its capacity to comprehend long contexts is limited compared to transformer-based models. In this study, we investigate the long-context efficiency issues of the Mamba models and propose ReMamba, which enhances Mamba’s ability to comprehend long contexts. ReMamba incorporates selective compression and adaptation techniques within a two-stage re-forward process, incurring minimal additional inference costs overhead. Experimental results on the LongBench and L-Eval benchmarks demonstrate ReMamba’s efficacy, improving over the baselines by 3.2 and 1.6 points, respectively, and attaining performance almost on par with same-size transformer models.