Huaiyu.wh Huaiyu.wh
2025
Marco-o1 v2: Towards Widening The Distillation Bottleneck for Reasoning Models
Huifeng Yin
|
Yu Zhao
|
Minghao Wu
|
Xuanfan Ni
|
Bo Zeng
|
Huaiyu.wh Huaiyu.wh
|
Tianqi Shi
|
Liangying Shao
|
Chenyang Lyu
|
Longyue Wang
|
Weihua Luo
|
Kaifu Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Reasoning Models (LRMs) such as OpenAI o1 and DeepSeek-R1 have shown remarkable reasoning capabilities by scaling test-time compute and generating long Chain-of-Thought (CoT). Distillation post-training on LRMs-generated data is a straightforward yet effective method to enhance the reasoning abilities of smaller models, but faces a critical bottleneck: we found that distilled long CoT data poses learning difficulty for small models and leads to the inheritance of biases (i.e., formalistic long-time thinking) when using Supervised Fine-tuning (SFT) and Reinforcement Learning (RL) methods. To alleviate this bottleneck, we propose constructing data from scratch using Monte Carlo Tree Search (MCTS). We then exploit a set of CoT-aware approaches, including Thoughts Length Balance, Fine-grained DPO, and Joint Post-training Objective, to enhance SFT and RL on the MCTS data. We conducted evaluation on various benchmarks such as math (GSM8K, MATH, AIME). instruction-following (Multi-IF) and planning (Blocksworld), results demonstrate our CoT-aware approaches substantially improve the reasoning performance of distilled models compared to standard distilled models via reducing the hallucinations in long-time thinking.
Search
Fix author
Co-authors
- Weihua Luo 1
- Chenyang Lyu 1
- Xuanfan Ni (倪宣凡) 1
- Liangying Shao 1
- Tianqi Shi 1
- show all...
Venues
- acl1