Bin Wen
2026
Why Can Distillation Work with Limited Resources? A Systematic Study
Xiao Hu | Xingyu Lu | Liyuan Mao | YiFan Zhang | Tianke Zhang | Bin Wen | Fan Yang | Tingting Gao | Guorui Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Xiao Hu | Xingyu Lu | Liyuan Mao | YiFan Zhang | Tianke Zhang | Bin Wen | Fan Yang | Tingting Gao | Guorui Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Recently, large language models have made remarkable progress in reasoning, largely driven by scaling data and model size. In parallel, several studies argue that for smaller models, high-quality distillation can yield strong reasoning performance with minimal resources. However, a framework for understanding machine reasoning that explains why low-resource distillation can boost model performance is still missing. In this paper, we conduct a controlled case study: using less than 920 examples, a simple distillation based on the base model can actually achieve notable reasoning performance improvement, compared with the base model and even the zero-RL models. By analyzing the token frequency in model outputs, we find that the distilled model shows more flexible reasoning. It uses anthropomorphic tokens and logical connectors much more often than the base and zero-RL model. Further analysis reveals that distillation enhances the presence of two advanced cognitive behaviors: Multi-Perspective Thinking or Attempting and Metacognitive Awareness. Frequent occurrences of these two advanced cognitive behaviors give rise to flexible reasoning, which is essential for solving reasoning problems.