Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond

Liang Wen; Yunke Cai; Fenrui Xiao; Xin He; Qi An; Zhenyu Duan; Yimin Du; Junchen Liu; Tanglifu Tanglifu; Xiaowei Lv; Haosheng Zou; Yongchao Deng; Shousheng Jia; Xiangzheng Zhang

Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond

Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Tanglifu Tanglifu, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang

Abstract

This paper introduces Light-R1, an opensource suite for training long reasoning modelsusing reproducible and cost-effective methodology. Given the proprietary nature of data usedin the DeepSeek-R1 series, we develop an alternative approach leveraging exclusively publicdata and models. Our curriculum training progressively increases data difficulty, combinedwith multi-staged post-training. Our LightR1-32B model, trained from Qwen2.5-32BInstruct, outperforms DeepSeek-R1-DistillQwen-32B in math reasoning. Experimental results show that this curriculum approachbecomes more effective when distinct, diverse datasets are available for different training stages: fine-tuning DeepSeek-R1-Distilledmodels (pre-tuned by DeepSeek team on proprietary data) with 3,000 challenging examplesfrom our curriculum dataset yielded state-ofthe-art 7B and 14B models, while the 32Bmodel, Light-R1-32B-DS performed comparably to QwQ-32B and DeepSeek-R1. Furthermore, we extend our work by applying GRPOon long reasoning models. Our final Light-R1-14B-DS achieves SOTA performance among14B models in math, with AIME24 & 25 scoresof 74.0 and 60.2 respectively, surpassing many32B models and DeepSeek-R1-Distill-Llama70B. Despite math-focused training, Light-R1-14B-DS demonstrates strong cross-domain generalization. Light-R1 represents a significantadvancement in making sophisticated reasoning models more accessible and implementablein real-world applications. Our models, training data and code have been made available.

Anthology ID:: 2025.acl-industry.24
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Georg Rehm, Yunyao Li
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 318–327
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-industry.24/
DOI:
Bibkey:
Cite (ACL):: Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Tanglifu Tanglifu, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, and Xiangzheng Zhang. 2025. Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), pages 318–327, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond (Wen et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-industry.24.pdf

PDF Cite Search Fix data