TinyBERT: Distilling BERT for Natural Language Understanding

Xiaoqi Jiao; Yichun Yin; Lifeng Shang; Xin Jiang; Xiao Chen; Linlin Li; Fang Wang; Qun Liu

doi:10.18653/v1/2020.findings-emnlp.372

TinyBERT: Distilling BERT for Natural Language Understanding

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu

Abstract

Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large “teacher” BERT can be effectively transferred to a small “student” TinyBERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pre-training and task-specific learning stages. This framework ensures that TinyBERT can capture the general-domain as well as the task-specific knowledge in BERT. TinyBERT4 with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERT-Base on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference. TinyBERT4 is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only ~28% parameters and ~31% inference time of them. Moreover, TinyBERT6 with 6 layers performs on-par with its teacher BERT-Base.

Anthology ID:: 2020.findings-emnlp.372
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2020
Month:: November
Year:: 2020
Address:: Online
Editors:: Trevor Cohn, Yulan He, Yang Liu
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4163–4174
Language:
URL:: https://aclanthology.org/2020.findings-emnlp.372
DOI:: 10.18653/v1/2020.findings-emnlp.372
Bibkey:
Cite (ACL):: Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2020. TinyBERT: Distilling BERT for Natural Language Understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174, Online. Association for Computational Linguistics.
Cite (Informal):: TinyBERT: Distilling BERT for Natural Language Understanding (Jiao et al., Findings 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/naacl24-info/2020.findings-emnlp.372.pdf
Optional supplementary material:: 2020.findings-emnlp.372.OptionalSupplementaryMaterial.zip
Code: huawei-noah/Pretrained-Language-Model + additional community code
Data: CoLA, GLUE, MRPC, MultiNLI, QNLI, Quora Question Pairs, RTE, SQuAD, SST, SST-2, STS Benchmark

PDF Search Code Optional supplementary material