RW-KD: Sample-wise Loss Terms Re-Weighting for Knowledge Distillation
Peng Lu, Abbas Ghaddar, Ahmad Rashid, Mehdi Rezagholizadeh, Ali Ghodsi, Philippe Langlais
Abstract
Knowledge Distillation (KD) is extensively used in Natural Language Processing to compress the pre-training and task-specific fine-tuning phases of large neural language models. A student model is trained to minimize a convex combination of the prediction loss over the labels and another over the teacher output. However, most existing works either fix the interpolating weight between the two losses apriori or vary the weight using heuristics. In this work, we propose a novel sample-wise loss weighting method, RW-KD. A meta-learner, simultaneously trained with the student, adaptively re-weights the two losses for each sample. We demonstrate, on 7 datasets of the GLUE benchmark, that RW-KD outperforms other loss re-weighting methods for KD.- Anthology ID:
- 2021.findings-emnlp.270
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2021
- Month:
- November
- Year:
- 2021
- Address:
- Punta Cana, Dominican Republic
- Editors:
- Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
- Venue:
- Findings
- SIG:
- SIGDAT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 3145–3152
- Language:
- URL:
- https://aclanthology.org/2021.findings-emnlp.270
- DOI:
- 10.18653/v1/2021.findings-emnlp.270
- Cite (ACL):
- Peng Lu, Abbas Ghaddar, Ahmad Rashid, Mehdi Rezagholizadeh, Ali Ghodsi, and Philippe Langlais. 2021. RW-KD: Sample-wise Loss Terms Re-Weighting for Knowledge Distillation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3145–3152, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Cite (Informal):
- RW-KD: Sample-wise Loss Terms Re-Weighting for Knowledge Distillation (Lu et al., Findings 2021)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-3/2021.findings-emnlp.270.pdf
- Data
- GLUE, QNLI