Abstract
Knowledge distillation is known as an effective technique for compressing over-parameterized language models. In this work, we propose to break down the global feature distillation task into N local sub-tasks. In this new framework, we consider each neuron in the last hidden layer of the teacher network as a specialized sub-teacher. We also consider each neuron in the last hidden layer of the student network as a focused sub-student. We make each focused sub-student learn from one corresponding specialized sub-teacher and ignore the others. This will facilitate the task for the sub-student and keep it focused. Our proposed method is novel and can be combined with other distillation techniques. Empirical results show that our proposed approach outperforms the state-of-the-art methods by maintaining higher performance on most benchmark datasets. Furthermore, we propose a randomized variant of our approach, called Masked One-to-One Mapping. Rather than learning all the N sub-tasks simultaneously, we focus on learning a subset of these sub-tasks at each optimization step. This variant enables the student to digest the received flow of knowledge more effectively and yields superior results.- Anthology ID:
- 2023.findings-emnlp.882
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2023
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Houda Bouamor, Juan Pino, Kalika Bali
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 13235–13245
- Language:
- URL:
- https://aclanthology.org/2023.findings-emnlp.882
- DOI:
- 10.18653/v1/2023.findings-emnlp.882
- Cite (ACL):
- Khouloud Saadi, Jelena Mitrović, and Michael Granitzer. 2023. Learn From One Specialized Sub-Teacher: One-to-One Mapping for Feature-Based Knowledge Distillation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 13235–13245, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- Learn From One Specialized Sub-Teacher: One-to-One Mapping for Feature-Based Knowledge Distillation (Saadi et al., Findings 2023)
- PDF:
- https://preview.aclanthology.org/emnlp22-frontmatter/2023.findings-emnlp.882.pdf