Qi Yi
2026
Reinforcement Learning on Pre-Training Data
Siheng Li | Kejiao Li | Zenan Xu | Guanhua Huang | Kun Li | Haoyuan Wu | Wujiajia | Zihao Zheng | Chenchen Zhang | Kun Shi | Xue Gong | Qi Yi | Ruibin Xiong | Tingqiang Xu | Yuhao Jiang | Jianfeng Yan | Yuyuan Zeng | Guanghui Xu | Jinbao Xue | Zhijiang xu | Zheng Fang | Shuai LI | Qibin Liu | Xiaoxue Li | Zhuoyu Li | Yangyu Tao | Fei Gao | Cheng Jiang | Bochao Wang | Kai Liu | Jianchen Zhu | Wai Lam | Bo Zhou | Di Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Siheng Li | Kejiao Li | Zenan Xu | Guanhua Huang | Kun Li | Haoyuan Wu | Wujiajia | Zihao Zheng | Chenchen Zhang | Kun Shi | Xue Gong | Qi Yi | Ruibin Xiong | Tingqiang Xu | Yuhao Jiang | Jianfeng Yan | Yuyuan Zeng | Guanghui Xu | Jinbao Xue | Zhijiang xu | Zheng Fang | Shuai LI | Qibin Liu | Xiaoxue Li | Zhuoyu Li | Yangyu Tao | Fei Gao | Cheng Jiang | Bochao Wang | Kai Liu | Jianchen Zhu | Wai Lam | Bo Zhou | Di Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent progress in large language models (LLMs) is largely driven by scaling training compute through either pre-training with next-token prediction (NTP) or post-training with reinforcement learning (RL). The former contributes to learning broad knowledge and skills from general data, while struggling with data inefficiency and catastrophic forgetting in continual learning settings. The latter incentivizes reasoning capabilities with strong generalization, but is constrained by limited data availability due to its reliance on human annotation. To alleviate these issues, we propose Reinforcement Learning on Pre-Training data (RLPT), which combines the advantages of learning from general data and RL. In particular, RLPT derives reward signals directly from general text data through a next-segment reasoning objective, rewarding the policy for correctly predicting next text segments conditioned on the prefix text. Experiments across multiple benchmarks and models demonstrate the effectiveness of . For example, RLPT yields substantial improvements in continual pre-training (+4.6%) and provides a strong foundation for post-training (+3.4%) on Qwen3-8B-Base.
AT²PO: Agentic Turn-based Policy Optimization via Tree Search
Zefang Zong | Dingwei Chen | Yang Li | Qi Yi | Bo Zhou | Chengming Li | BO Qian | Peng Chen | Jie Jiang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zefang Zong | Dingwei Chen | Yang Li | Qi Yi | Bo Zhou | Chengming Li | BO Qian | Peng Chen | Jie Jiang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
LLM agents have emerged as powerful systems for tackling multi-turn tasks by interleaving internal reasoning and external tool interactions. Agentic Reinforcement Learning has recently drawn significant research attention as a critical post-training paradigm to further refine these capabilities. In this paper, we present AT²PO (**A**gentic **T**urn-based **P**olicy **O**ptimization via **T**ree Search), a unified framework for multi-turn agentic RL that addresses three core challenges: limited exploration diversity, sparse credit assignment, and misaligned policy optimization. AT²PO introduces a turn-level tree structure that jointly enables Entropy-Guided Tree Expansion for strategic exploration and Turn-wise Credit Assignment for fine-grained reward propagation from sparse outcomes. Complementing this, we propose Agentic Turn-based Policy Optimization, a turn-level learning objective that aligns policy updates with the natural decision granularity of agentic interactions. ATPO is orthogonal to tree search and can be readily integrated into any multi-turn RL pipeline. Experiments across seven benchmarks demonstrate consistent improvements over the state-of-the-art baseline by up to 1.84 percentage points in average, with ablation studies validating the effectiveness of each component.
Search
Fix author
Co-authors
- Bo Zhou 2
- Dingwei Chen 1
- Peng Chen 1
- Zheng Fang 1
- Fei Gao 1
- Xue Gong 1
- Guanhua Huang 1
- Cheng Jiang 1
- Jie Jiang 1
- Yuhao Jiang 1
- Shuai LI 1
- Wai Lam 1
- Chengming Li 1
- Kejiao Li 1
- Kun Li 1
- Siheng Li 1
- Xiaoxue Li 1
- Yang Li 1
- Zhuoyu Li 1
- Kai Liu 1
- Qibin Liu 1
- BO Qian 1
- Kun Shi 1
- Yangyu Tao 1
- Bochao Wang 1
- Di Wang 1
- Haoyuan Wu 1
- Wujiajia 1
- Ruibin Xiong 1
- Guanghui Xu 1
- Tingqiang Xu 1
- Zenan Xu 1
- Jinbao Xue 1
- Jianfeng Yan 1
- Yuyuan Zeng 1
- Chenchen Zhang 1
- Zihao Zheng 1
- Jianchen Zhu 1
- Zefang Zong 1
- Zhijiang xu 1
Venues
- ACL2