GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model

Shicheng Tan; Weng Lam Tam; Yuanchun Wang; Wenwen Gong; Shu Zhao; Peng Zhang; Jie Tang

doi:10.18653/v1/2023.acl-industry.15

GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model

Shicheng Tan, Weng Lam Tam, Yuanchun Wang, Wenwen Gong, Shu Zhao, Peng Zhang, Jie Tang

Abstract

Currently, the reduction in the parameter scale of large-scale pre-trained language models (PLMs) through knowledge distillation has greatly facilitated their widespread deployment on various devices. However, the deployment of knowledge distillation systems faces great challenges in real-world industrial-strength applications, which require the use of complex distillation methods on even larger-scale PLMs (over 10B), limited by memory on GPUs and the switching of methods. To overcome these challenges, we propose GKD, a general knowledge distillation framework that supports distillation on larger-scale PLMs using various distillation methods. With GKD, developers can build larger distillation models on memory-limited GPUs and easily switch and combine different distillation methods within a single framework. Experimental results show that GKD can support the distillation of at least 100B-scale PLMs and 25 mainstream methods on 8 NVIDIA A100 (40GB) GPUs.

Anthology ID:: 2023.acl-industry.15
Volume:: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Sunayana Sitaram, Beata Beigman Klebanov, Jason D Williams
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 134–148
Language:
URL:: https://aclanthology.org/2023.acl-industry.15
DOI:: 10.18653/v1/2023.acl-industry.15
Bibkey:
Cite (ACL):: Shicheng Tan, Weng Lam Tam, Yuanchun Wang, Wenwen Gong, Shu Zhao, Peng Zhang, and Jie Tang. 2023. GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pages 134–148, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model (Tan et al., ACL 2023)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-1/2023.acl-industry.15.pdf

PDF Search