Knowledge Distillation for Language Models

Yuqiao Wen, Freda Shi, Lili Mou


Abstract
Knowledge distillation (KD) aims to transfer the knowledge of a teacher (usually a large model) to a student (usually a small one). In this tutorial, our goal is to provide participants with a comprehensive understanding of the techniques and applications of KD for language models. After introducing the basic concepts including intermediate-layer matching and prediction matching, we will present advanced techniques such as reinforcement learning-based KD and multi-teacher distillation. For applications, we will focus on KD for large language models (LLMs), covering topics ranging from LLM sequence compression to LLM self-distillation. The target audience is expected to know the basics of machine learning and NLP, but do not have to be familiar with the details of math derivation and neural models
Anthology ID:
2025.naacl-tutorial.4
Volume:
Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 5: Tutorial Abstracts)
Month:
May
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Maria Lomeli, Swabha Swayamdipta, Rui Zhang
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
25–29
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.naacl-tutorial.4/
DOI:
Bibkey:
Cite (ACL):
Yuqiao Wen, Freda Shi, and Lili Mou. 2025. Knowledge Distillation for Language Models. In Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 5: Tutorial Abstracts), pages 25–29, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Knowledge Distillation for Language Models (Wen et al., NAACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.naacl-tutorial.4.pdf