Haosheng Zou

2025

This paper introduces Light-R1, an opensource suite for training long reasoning modelsusing reproducible and cost-effective methodology. Given the proprietary nature of data usedin the DeepSeek-R1 series, we develop an alternative approach leveraging exclusively publicdata and models. Our curriculum training progressively increases data difficulty, combinedwith multi-staged post-training. Our LightR1-32B model, trained from Qwen2.5-32BInstruct, outperforms DeepSeek-R1-DistillQwen-32B in math reasoning. Experimental results show that this curriculum approachbecomes more effective when distinct, diverse datasets are available for different training stages: fine-tuning DeepSeek-R1-Distilledmodels (pre-tuned by DeepSeek team on proprietary data) with 3,000 challenging examplesfrom our curriculum dataset yielded state-ofthe-art 7B and 14B models, while the 32Bmodel, Light-R1-32B-DS performed comparably to QwQ-32B and DeepSeek-R1. Furthermore, we extend our work by applying GRPOon long reasoning models. Our final Light-R1-14B-DS achieves SOTA performance among14B models in math, with AIME24 & 25 scoresof 74.0 and 60.2 respectively, surpassing many32B models and DeepSeek-R1-Distill-Llama70B. Despite math-focused training, Light-R1-14B-DS demonstrates strong cross-domain generalization. Light-R1 represents a significantadvancement in making sophisticated reasoning models more accessible and implementablein real-world applications. Our models, training data and code have been made available.

Recent advances in Large Language Models (LLMs) have highlighted the challenge of handling long-context tasks, where models need to reason over extensive input contexts to aggregate target information. While Chain-of-Thought (CoT) prompting has shown promise for multi-step reasoning, its effectiveness for long-context scenarios remains underexplored. Through systematic investigation across diverse tasks, we demonstrate that CoT’s benefits generalize across most long-context scenarios and amplify with increasing context length. Motivated by this, we propose a process-supervised framework that teaches models to generate high-quality reasoning paths for enhanced long-context performance. Our framework incorporates a self-sampling mechanism to bootstrap reasoning paths and a novel quality assessment protocol specifically designed for long-context scenarios. This protocol evaluates both answer correctness and process reliability, with the latter decomposed into source faithfulness and intrinsic consistency components for efficient and accurate assessment. Experimental results on various long-context benchmarks demonstrate the effectiveness of our approach, achieving significant improvements over outcome supervision baselines on both in-domain tasks (+13.6/+3.8 points for LLaMA/Qwen on MuSiQue) and cross-domain generalization (+9.3/+8.1 points on average across diverse QA tasks). Our code, data and trained models will be released upon acceptance.