Towards Non-task-specific Distillation of BERT via Sentence Representation Approximation

Bowen Wu, Huan Zhang, MengYuan Li, Zongsheng Wang, Qihang Feng, Junhong Huang, Baoxun Wang


Abstract
Recently, BERT has become an essential ingredient of various NLP deep models due to its effectiveness and universal-usability. However, the online deployment of BERT is often blocked by its large-scale parameters and high computational cost. There are plenty of studies showing that the knowledge distillation is efficient in transferring the knowledge from BERT into the model with a smaller size of parameters. Nevertheless, current BERT distillation approaches mainly focus on task-specified distillation, such methodologies lead to the loss of the general semantic knowledge of BERT for universal-usability. In this paper, we propose a sentence representation approximating oriented distillation framework that can distill the pre-trained BERT into a simple LSTM based model without specifying tasks. Consistent with BERT, our distilled model is able to perform transfer learning via fine-tuning to adapt to any sentence-level downstream task. Besides, our model can further cooperate with task-specific distillation procedures. The experimental results on multiple NLP tasks from the GLUE benchmark show that our approach outperforms other task-specific distillation methods or even much larger models, i.e., ELMO, with efficiency well-improved.
Anthology ID:
2020.aacl-main.9
Volume:
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing
Month:
December
Year:
2020
Address:
Suzhou, China
Venue:
AACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
70–79
Language:
URL:
https://aclanthology.org/2020.aacl-main.9
DOI:
Bibkey:
Cite (ACL):
Bowen Wu, Huan Zhang, MengYuan Li, Zongsheng Wang, Qihang Feng, Junhong Huang, and Baoxun Wang. 2020. Towards Non-task-specific Distillation of BERT via Sentence Representation Approximation. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 70–79, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Towards Non-task-specific Distillation of BERT via Sentence Representation Approximation (Wu et al., AACL 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2020.aacl-main.9.pdf
Data
GLUEMRPCMultiNLISST