Abstract
In this work, we propose a novel framework, Gradient Aligned Mutual Learning BERT (GAML-BERT), for improving the early exiting of BERT. GAML-BERT’s contributions are two-fold. We conduct a set of pilot experiments, which shows that mutual knowledge distillation between a shallow exit and a deep exit leads to better performances for both. From this observation, we use mutual learning to improve BERT’s early exiting performances, that is, we ask each exit of a multi-exit BERT to distill knowledge from each other. Second, we propose GA, a novel training method that aligns the gradients from knowledge distillation to cross-entropy losses. Extensive experiments are conducted on the GLUE benchmark, which shows that our GAML-BERT can significantly outperform the state-of-the-art (SOTA) BERT early exiting methods.- Anthology ID:
- 2021.emnlp-main.242
- Volume:
- Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2021
- Address:
- Online and Punta Cana, Dominican Republic
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 3033–3044
- Language:
- URL:
- https://aclanthology.org/2021.emnlp-main.242
- DOI:
- 10.18653/v1/2021.emnlp-main.242
- Cite (ACL):
- Wei Zhu, Xiaoling Wang, Yuan Ni, and Guotong Xie. 2021. GAML-BERT: Improving BERT Early Exiting by Gradient Aligned Mutual Learning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3033–3044, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Cite (Informal):
- GAML-BERT: Improving BERT Early Exiting by Gradient Aligned Mutual Learning (Zhu et al., EMNLP 2021)
- PDF:
- https://preview.aclanthology.org/remove-xml-comments/2021.emnlp-main.242.pdf
- Data
- GLUE