Limitations of Knowledge Distillation for Zero-shot Transfer Learning

Saleh Soltan, Haidar Khan, Wael Hamza


Abstract
Pretrained transformer-based encoders such as BERT have been demonstrated to achieve state-of-the-art performance on numerous NLP tasks. Despite their success, BERT style encoders are large in size and have high latency during inference (especially on CPU machines) which make them unappealing for many online applications. Recently introduced compression and distillation methods have provided effective ways to alleviate this shortcoming. However, the focus of these works has been mainly on monolingual encoders. Motivated by recent successes in zero-shot cross-lingual transfer learning using multilingual pretrained encoders such as mBERT, we evaluate the effectiveness of Knowledge Distillation (KD) both during pretraining stage and during fine-tuning stage on multilingual BERT models. We demonstrate that in contradiction to the previous observation in the case of monolingual distillation, in multilingual settings, distillation during pretraining is more effective than distillation during fine-tuning for zero-shot transfer learning. Moreover, we observe that distillation during fine-tuning may hurt zero-shot cross-lingual performance. Finally, we demonstrate that distilling a larger model (BERT Large) results in the strongest distilled model that performs best both on the source language as well as target languages in zero-shot settings.
Anthology ID:
2021.sustainlp-1.3
Volume:
Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing
Month:
November
Year:
2021
Address:
Virtual
Editors:
Nafise Sadat Moosavi, Iryna Gurevych, Angela Fan, Thomas Wolf, Yufang Hou, Ana Marasović, Sujith Ravi
Venue:
sustainlp
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
22–31
Language:
URL:
https://aclanthology.org/2021.sustainlp-1.3
DOI:
10.18653/v1/2021.sustainlp-1.3
Bibkey:
Cite (ACL):
Saleh Soltan, Haidar Khan, and Wael Hamza. 2021. Limitations of Knowledge Distillation for Zero-shot Transfer Learning. In Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing, pages 22–31, Virtual. Association for Computational Linguistics.
Cite (Informal):
Limitations of Knowledge Distillation for Zero-shot Transfer Learning (Soltan et al., sustainlp 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl-24-ws-corrections/2021.sustainlp-1.3.pdf
Video:
 https://preview.aclanthology.org/naacl-24-ws-corrections/2021.sustainlp-1.3.mp4
Data
MTOPPAWS-XXNLI