Tanglifu Tanglifu
2025
Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond
Liang Wen
|
Yunke Cai
|
Fenrui Xiao
|
Xin He
|
Qi An
|
Zhenyu Duan
|
Yimin Du
|
Junchen Liu
|
Tanglifu Tanglifu
|
Xiaowei Lv
|
Haosheng Zou
|
Yongchao Deng
|
Shousheng Jia
|
Xiangzheng Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
This paper introduces Light-R1, an opensource suite for training long reasoning modelsusing reproducible and cost-effective methodology. Given the proprietary nature of data usedin the DeepSeek-R1 series, we develop an alternative approach leveraging exclusively publicdata and models. Our curriculum training progressively increases data difficulty, combinedwith multi-staged post-training. Our LightR1-32B model, trained from Qwen2.5-32BInstruct, outperforms DeepSeek-R1-DistillQwen-32B in math reasoning. Experimental results show that this curriculum approachbecomes more effective when distinct, diverse datasets are available for different training stages: fine-tuning DeepSeek-R1-Distilledmodels (pre-tuned by DeepSeek team on proprietary data) with 3,000 challenging examplesfrom our curriculum dataset yielded state-ofthe-art 7B and 14B models, while the 32Bmodel, Light-R1-32B-DS performed comparably to QwQ-32B and DeepSeek-R1. Furthermore, we extend our work by applying GRPOon long reasoning models. Our final Light-R1-14B-DS achieves SOTA performance among14B models in math, with AIME24 & 25 scoresof 74.0 and 60.2 respectively, surpassing many32B models and DeepSeek-R1-Distill-Llama70B. Despite math-focused training, Light-R1-14B-DS demonstrates strong cross-domain generalization. Light-R1 represents a significantadvancement in making sophisticated reasoning models more accessible and implementablein real-world applications. Our models, training data and code have been made available.
Search
Fix author
Co-authors
- Qi An 1
- Yunke Cai 1
- Yongchao Deng 1
- Yimin Du 1
- Zhenyu Duan 1
- show all...
Venues
- acl1