Juan Yeo
2026
Stable On-Policy Distillation through Adaptive Target Reformulation
Ijun Jang | Jewon Yeom | Juan Yeo | Hyunggyu Lim | Taesup Kim
Findings of the Association for Computational Linguistics: ACL 2026
Ijun Jang | Jewon Yeom | Juan Yeo | Hyunggyu Lim | Taesup Kim
Findings of the Association for Computational Linguistics: ACL 2026
Knowledge distillation (KD) is a widely adopted technique for transferring capabilities from large language models to smaller student models. However, conventional supervised KD often suffers from a distribution mismatch between training and inference. While on-policy KD approaches attempt to mitigate this issue by learning directly from student-generated outputs, they frequently encounter training instabilities and noisy teacher feedback during early optimization stages. These challenges manifest as pathological gradients in forward KL objectives when students encounter unfamiliar tokens, or as a collapse in distributional diversity within reverse KL regimes. To address these limitations, we propose Veto, an objective-level reformulation that constructs a geometric target distribution in logit space to emphasize agreement between the teacher and the student. By introducing a tunable parameter 𝛽, Veto serves as an Adaptive Gradient Veto that stabilizes optimization by suppressing harmful gradients on low-confidence tokens, while simultaneously acting as a Decisiveness Knob to balance reward-driven performance with output diversity. Extensive experiments across various reasoning and generation tasks demonstrate that Veto consistently outperforms supervised fine-tuning and existing on-policy baselines.