Ruoyu Wu


2025

pdf bib
Towards A Better Initial Policy Model For Scalable Long-CoT Reinforcement Learning
Bofei Gao | Yejie Wang | Yibo Miao | Ruoyu Wu | Feifan Song | Longhui Yu | Tianyu Liu | Baobao Chang
Findings of the Association for Computational Linguistics: ACL 2025

Long-CoT reasoning combined with reinforcement learning for large language models demonstrates remarkable performance and scalability. However, we observe that the initial policy model could significantly influence the final performance as well as the token efficiency. Additionally, there is a lack of systematic guidelines for obtaining a better initial policy model. To bridge this gap, we initiate a comprehensive investigation by activating the initial model using a variety of datasets with different data volumes and reasoning patterns. Then, we conduct a thorough analysis and comparison of the RL process for different initial models from the perspectives of upper bounds, diversity, and token efficiency, providing a deeper understanding and insight into the long-CoT RL. Based on our empirical results, we propose a systematic guideline and a novel Re-RFT method for constructing a better RL start point. Our experiment results based on the 14B model surpass the DeepSeek-R1-Distill-Qwen-14B by an average of 4.6%, demonstrating our approach’s effectiveness and superiority.