Towards A Better Initial Policy Model For Scalable Long-CoT Reinforcement Learning

Bofei Gao; Yejie Wang; Yibo Miao; Ruoyu Wu; Feifan Song; Longhui Yu; Tianyu Liu; Baobao Chang (常宝宝)

doi:10.18653/v1/2025.findings-acl.397

Towards A Better Initial Policy Model For Scalable Long-CoT Reinforcement Learning

Bofei Gao, Yejie Wang, Yibo Miao, Ruoyu Wu, Feifan Song, Longhui Yu, Tianyu Liu, Baobao Chang

Abstract

Long-CoT reasoning combined with reinforcement learning for large language models demonstrates remarkable performance and scalability. However, we observe that the initial policy model could significantly influence the final performance as well as the token efficiency. Additionally, there is a lack of systematic guidelines for obtaining a better initial policy model. To bridge this gap, we initiate a comprehensive investigation by activating the initial model using a variety of datasets with different data volumes and reasoning patterns. Then, we conduct a thorough analysis and comparison of the RL process for different initial models from the perspectives of upper bounds, diversity, and token efficiency, providing a deeper understanding and insight into the long-CoT RL. Based on our empirical results, we propose a systematic guideline and a novel Re-RFT method for constructing a better RL start point. Our experiment results based on the 14B model surpass the DeepSeek-R1-Distill-Qwen-14B by an average of 4.6%, demonstrating our approach’s effectiveness and superiority.

Anthology ID:: 2025.findings-acl.397
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7652–7665
Language:
URL:: https://preview.aclanthology.org/mtsummit-25-ingestion/2025.findings-acl.397/
DOI:: 10.18653/v1/2025.findings-acl.397
Bibkey:
Cite (ACL):: Bofei Gao, Yejie Wang, Yibo Miao, Ruoyu Wu, Feifan Song, Longhui Yu, Tianyu Liu, and Baobao Chang. 2025. Towards A Better Initial Policy Model For Scalable Long-CoT Reinforcement Learning. In Findings of the Association for Computational Linguistics: ACL 2025, pages 7652–7665, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Towards A Better Initial Policy Model For Scalable Long-CoT Reinforcement Learning (Gao et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/mtsummit-25-ingestion/2025.findings-acl.397.pdf

PDF Cite Search Fix data