ToDi: Token-wise Distillation via Fine-Grained Divergence Control

Seongryong Jung, Suwan Yoon, DongGeon Kim, Hwanhee Lee


Abstract
Large language models (LLMs) offer impressive performance but are impractical for resource-constrained deployment due to high latency and energy consumption. Knowledge distillation (KD) addresses this by transferring knowledge from a large teacher to a smaller student model. However, conventional KD, notably approaches like Forward KL (FKL) and Reverse KL (RKL), apply uniform divergence loss across the entire vocabulary, neglecting token-level prediction discrepancies. By investigating these representative divergences via gradient analysis, we reveal that FKL boosts underestimated tokens, while RKL suppresses overestimated ones, showing their complementary roles. Based on this observation, we propose Token-wise Distillation (ToDi), a novel method that adaptively combines FKL and RKL per token using a sigmoid-based weighting function derived from the teacher-student probability log-ratio. ToDi dynamically emphasizes the appropriate divergence for each token, enabling precise distribution alignment. We demonstrate that ToDi consistently outperforms recent distillation baselines using uniform or less granular strategies across instruction-following benchmarks. Extensive ablation studies and efficiency analysis further validate ToDi’s effectiveness and practicality.
Anthology ID:
2025.emnlp-main.409
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8089–8102
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.409/
DOI:
Bibkey:
Cite (ACL):
Seongryong Jung, Suwan Yoon, DongGeon Kim, and Hwanhee Lee. 2025. ToDi: Token-wise Distillation via Fine-Grained Divergence Control. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 8089–8102, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
ToDi: Token-wise Distillation via Fine-Grained Divergence Control (Jung et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.409.pdf
Checklist:
 2025.emnlp-main.409.checklist.pdf