Tingqiang Xu
2026
Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward
Guanhua Huang | Tingqiang Xu | Mingze Wang | Qi Yi | Xue Gong | Siheng Li | Ruibin Xiong | Kejiao Li | Yuhao Jiang | Bo Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Guanhua Huang | Tingqiang Xu | Mingze Wang | Qi Yi | Xue Gong | Siheng Li | Ruibin Xiong | Kejiao Li | Yuhao Jiang | Bo Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Reinforcement Learning with Verifiable Rewards (RLVR) has propelled Large Language Models in complex reasoning, yet its scalability is often hindered by a training bottleneck where performance plateaus as policy entropy collapses, signaling a loss of exploration. While previous methods attempt to maintain high entropy, we argue that unselective entropy maximization risks amplifying irrelevant noise rather than fostering meaningful exploration. In this paper, we identify a deeper issue: the gradual elimination of valuable low-probability exploratory tokens, which we term reasoning sparks, driven by RLVR over-penalization. To address this, we introduce Low-probability Regularization (Lp-Reg). Leveraging the statistical distinction where reasoning sparks exhibit higher probabilities than noise, Lp-Reg filters out the extremely low-probability noise tokens and prevents the suppression of potentially valuable low-probability candidates. Experiments demonstrate that Lp-Reg enables stable on-policy training for over 3,000 steps (81,204 GPU-hours), sustaining exploration in regimes where baselines typically collapse. Validated across extensive evaluations totaling over 300,000 cumulative GPU-hours, Lp-Reg demonstrates highly competitive performance in off-policy settings and consistently achieves state-of-the-art results in on-policy training across diverse model families, sizes, and domains, with relative accuracy improvements ranging from 3.06% to 7.98%.
Reinforcement Learning on Pre-Training Data
Siheng Li | Kejiao Li | Zenan Xu | Guanhua Huang | Kun Li | Haoyuan Wu | Wujiajia | Zihao Zheng | Chenchen Zhang | Kun Shi | Xue Gong | Qi Yi | Ruibin Xiong | Tingqiang Xu | Yuhao Jiang | Jianfeng Yan | Yuyuan Zeng | Guanghui Xu | Jinbao Xue | Zhijiang xu | Zheng Fang | Shuai LI | Qibin Liu | Xiaoxue Li | Zhuoyu Li | Yangyu Tao | Fei Gao | Cheng Jiang | Bochao Wang | Kai Liu | Jianchen Zhu | Wai Lam | Bo Zhou | Di Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Siheng Li | Kejiao Li | Zenan Xu | Guanhua Huang | Kun Li | Haoyuan Wu | Wujiajia | Zihao Zheng | Chenchen Zhang | Kun Shi | Xue Gong | Qi Yi | Ruibin Xiong | Tingqiang Xu | Yuhao Jiang | Jianfeng Yan | Yuyuan Zeng | Guanghui Xu | Jinbao Xue | Zhijiang xu | Zheng Fang | Shuai LI | Qibin Liu | Xiaoxue Li | Zhuoyu Li | Yangyu Tao | Fei Gao | Cheng Jiang | Bochao Wang | Kai Liu | Jianchen Zhu | Wai Lam | Bo Zhou | Di Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent progress in large language models (LLMs) is largely driven by scaling training compute through either pre-training with next-token prediction (NTP) or post-training with reinforcement learning (RL). The former contributes to learning broad knowledge and skills from general data, while struggling with data inefficiency and catastrophic forgetting in continual learning settings. The latter incentivizes reasoning capabilities with strong generalization, but is constrained by limited data availability due to its reliance on human annotation. To alleviate these issues, we propose Reinforcement Learning on Pre-Training data (RLPT), which combines the advantages of learning from general data and RL. In particular, RLPT derives reward signals directly from general text data through a next-segment reasoning objective, rewarding the policy for correctly predicting next text segments conditioned on the prefix text. Experiments across multiple benchmarks and models demonstrate the effectiveness of . For example, RLPT yields substantial improvements in continual pre-training (+4.6%) and provides a strong foundation for post-training (+3.4%) on Qwen3-8B-Base.
Search
Fix author
Co-authors
- Xue Gong 2
- Guanhua Huang 2
- Yuhao Jiang 2
- Siheng Li 2
- Kejiao Li 2
- Ruibin Xiong 2
- Qi Yi 2
- Bo Zhou 2
- Zheng Fang 1
- Fei Gao 1
- Cheng Jiang 1
- Shuai LI 1
- Wai Lam 1
- Kun Li 1
- Xiaoxue Li 1
- Zhuoyu Li 1
- Qibin Liu 1
- Kai Liu 1
- Kun Shi 1
- Yangyu Tao 1
- Mingze Wang 1
- Bochao Wang 1
- Di Wang 1
- Haoyuan Wu 1
- Wujiajia 1
- Zenan Xu 1
- Guanghui Xu 1
- Jinbao Xue 1
- Jianfeng Yan 1
- Yuyuan Zeng 1
- Chenchen Zhang 1
- Zihao Zheng 1
- Jianchen Zhu 1
- Zhijiang xu 1