Stable On-Policy Distillation through Adaptive Target Reformulation

Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggyu Lim, Taesup Kim


Abstract
Knowledge distillation (KD) is a widely adopted technique for transferring capabilities from large language models to smaller student models. However, conventional supervised KD often suffers from a distribution mismatch between training and inference. While on-policy KD approaches attempt to mitigate this issue by learning directly from student-generated outputs, they frequently encounter training instabilities and noisy teacher feedback during early optimization stages. These challenges manifest as pathological gradients in forward KL objectives when students encounter unfamiliar tokens, or as a collapse in distributional diversity within reverse KL regimes. To address these limitations, we propose Veto, an objective-level reformulation that constructs a geometric target distribution in logit space to emphasize agreement between the teacher and the student. By introducing a tunable parameter 𝛽, Veto serves as an Adaptive Gradient Veto that stabilizes optimization by suppressing harmful gradients on low-confidence tokens, while simultaneously acting as a Decisiveness Knob to balance reward-driven performance with output diversity. Extensive experiments across various reasoning and generation tasks demonstrate that Veto consistently outperforms supervised fine-tuning and existing on-policy baselines.
Anthology ID:
2026.findings-acl.2094
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
42217–42227
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2094/
DOI:
Bibkey:
Cite (ACL):
Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggyu Lim, and Taesup Kim. 2026. Stable On-Policy Distillation through Adaptive Target Reformulation. In Findings of the Association for Computational Linguistics: ACL 2026, pages 42217–42227, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Stable On-Policy Distillation through Adaptive Target Reformulation (Jang et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2094.pdf
Checklist:
 2026.findings-acl.2094.checklist.pdf