Haijun Lv
2026
Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment
Yuming Yang | Mingyoung Lai | Wanxu Zhao | Xiaoran Fan | Zhiheng Xi | Mingqi Wu | Chiyue Huang | Jun Zhao | Haijun Lv | Jian Tong | Yunhua Zhou | Yicheng Zou | Qipeng Guo | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuming Yang | Mingyoung Lai | Wanxu Zhao | Xiaoran Fan | Zhiheng Xi | Mingqi Wu | Chiyue Huang | Jun Zhao | Haijun Lv | Jian Tong | Yunhua Zhou | Yicheng Zou | Qipeng Guo | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Long chain-of-thought (CoT) trajectories provide rich supervision signals for distilling reasoning from teacher to student LLMs. However, both prior work and our experiments show that trajectories from stronger teachers do not necessarily yield better students, highlighting the importance of data-student suitability in distillation. Existing methods assess suitability primarily through student likelihood, favoring trajectories that align closely with the student model’s current behavior but overlooking more informative ones. Addressing this, we propose Rank–Surprisal Ratio (RSR), a simple metric that captures both alignment and informativeness to assess the suitability of a reasoning trajectory. RSR is motivated by the observation that effective trajectories typically balance learning signal strength and behavioral alignment by combining low absolute probability with relatively high-ranked tokens under the student model.Concretely, RSR is defined as the ratio of a trajectory’s average token-wise rank to its average negative log-likelihood, and is straightforward to compute and interpret. Across five student models and reasoning trajectories from 11 diverse teachers, RSR strongly correlates with post-training reasoning performance (average Spearman 0.86), consistently outperforming existing metrics. We further demonstrate its practical utility in both trajectory selection and teacher selection.
Rethinking Multiple-Choice Questions for RLVR: Unlocking Potential via Distractor Design
Xu Guo | Qiming Ge | Jian Tong | Kedi Chen | Jin Zhang | Xiaogui Yang | Xuan Gao | Haijun Lv | Zhihui Lu | Yicheng Zou | Qipeng Guo
Findings of the Association for Computational Linguistics: ACL 2026
Xu Guo | Qiming Ge | Jian Tong | Kedi Chen | Jin Zhang | Xiaogui Yang | Xuan Gao | Haijun Lv | Zhihui Lu | Yicheng Zou | Qipeng Guo
Findings of the Association for Computational Linguistics: ACL 2026
Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capabilities of Large Language Models. When applied to RLVR, Multiple-Choice Questions (MCQs) offer a scalable source of verifiable data but risk inducing reward hacking, where models shortcut reasoning via random guessing or simple elimination. Current approaches often mitigate this by converting MCQs to open-ended formats, thereby discarding the contrastive signal provided by expert-designed distractors. In this work, we systematically investigate the impact of option design on RLVR. Our analysis highlights two primary insights: (1) Mismatches in option counts between training and testing degrade performance. (2) Strong distractors effectively mitigate random guessing, enabling effective RLVR training even with 2-way questions. Motivated by these findings, we propose Iterative Distractor Curation (IDC), a framework that actively constructs high-quality distractors to block elimination shortcuts and promote deep reasoning. Experiments on various benchmarks demonstrate that our method effectively enhances distractor quality and yields significant gains in RLVR training compared to the original data.
2025
What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices
Zhi Chen | Qiguang Chen | Libo Qin | Qipeng Guo | Haijun Lv | Yicheng Zou | Hang Yan | Kai Chen | Dahua Lin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhi Chen | Qiguang Chen | Libo Qin | Qipeng Guo | Haijun Lv | Yicheng Zou | Hang Yan | Kai Chen | Dahua Lin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent advancements in large language models (LLMs) with extended context windows have significantly improved various tasks. To improve long-context capabilities, much work focuses on augmenting LLM’s capabilities with synthetic data. Existing methods often leverage the Self-Instruct framework to generate long-context instruction-tuning data. However, our preliminary experiments show that fewer than 35% of samples generated by Qwen-2-72B are multi-hop, and over 40% exhibit poor quality, limiting comprehensive understanding and further research. To address this, we propose the Multi-agent Interactive Multi-hop Generation (MIMG) framework, which integrates a quality verification agent, a single-hop question generation agent, a multiple question sampling strategy, and a multi-hop question merger agent. This framework significantly improves data quality, with high-quality, multi-hop, and diverse data. Furthermore, we conduct a thorough analysis of document selection, question merging, and validation techniques through extensive experiments across various models. Our results demonstrate that synthetic high-quality long-context instruction data can enhance model performance, surpassing even models trained on larger amounts of human-annotated data.
2024
AdaLomo: Low-memory Optimization with Adaptive Learning Rate
Kai Lv | Hang Yan | Qipeng Guo | Haijun Lv | Xipeng Qiu
Findings of the Association for Computational Linguistics: ACL 2024
Kai Lv | Hang Yan | Qipeng Guo | Haijun Lv | Xipeng Qiu
Findings of the Association for Computational Linguistics: ACL 2024
Large language models have achieved remarkable success, but their extensive parameter size necessitates substantial memory for training, thereby setting a high threshold. While the recently proposed low-memory optimization (LOMO) reduces memory footprint, its optimization technique, akin to stochastic gradient descent, is sensitive to hyper-parameters and exhibits suboptimal convergence, failing to match the performance of the prevailing optimizer for large language models, AdamW. Through analysis of the Adam optimizer, we found that, compared to momentum, the adaptive learning rate is more critical for bridging the gap. Building on this insight, we introduce the low-memory optimization with adaptive learning rate (AdaLomo), which offers an adaptive learning rate for each parameter and exhibits superior convergence performance compared to LOMO theoretically. To maintain memory efficiency, we employ non-negative matrix factorization for the second-order moment estimation. Additionally, we suggest the use of a grouped update normalization to stabilize convergence. Our experiments with instruction-tuning and further pre-training demonstrate that AdaLomo achieves results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models. The code is accessible at https://github.com/OpenLMLab/LOMO.
Search
Fix author
Co-authors
- Qipeng Guo 4
- Yicheng Zou 3
- Jian Tong 2
- Hang Yan 2
- Kedi Chen 1
- Zhi Chen 1
- Qiguang Chen (陈麒光) 1
- Kai Chen 1
- Xiaoran Fan 1
- Xuan Gao 1
- Qiming Ge 1
- Tao Gui 1
- Xu Guo 1
- Chiyue Huang 1
- Xuan-Jing Huang (黄萱菁) 1
- Mingyoung Lai 1
- Dahua Lin 1
- Zhihui Lu 1
- Kai Lv 1
- Libo Qin 1
- Xipeng Qiu (邱锡鹏) 1
- Mingqi Wu 1
- Zhiheng Xi 1
- Yuming Yang 1
- Xiaogui Yang 1
- Qi Zhang 1
- Jin Zhang 1
- Wanxu Zhao 1
- Jun Zhao 1
- Yunhua Zhou 1