Jiajie Zhang
2022
Knowledge Inheritance for Pre-trained Language Models
Yujia Qin
|
Yankai Lin
|
Jing Yi
|
Jiajie Zhang
|
Xu Han
|
Zhengyan Zhang
|
Yusheng Su
|
Zhiyuan Liu
|
Peng Li
|
Maosong Sun
|
Jie Zhou
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Recent explorations of large-scale pre-trained language models (PLMs) have revealed the power of PLMs with huge amounts of parameters, setting off a wave of training ever-larger PLMs. However, it requires tremendous computational resources to train a large-scale PLM, which may be practically unaffordable. In addition, existing large-scale PLMs are mainly trained from scratch individually, ignoring that many well-trained PLMs are available. To this end, we explore the question how could existing PLMs benefit training large-scale PLMs in future. Specifically, we introduce a pre-training framework named “knowledge inheritance” (KI) and explore how could knowledge distillation serve as auxiliary supervision during pre-training to efficiently learn larger PLMs. Experimental results demonstrate the superiority of KI in training efficiency. We also conduct empirical analyses to explore the effects of teacher PLMs’ pre-training settings, including model architecture, pre-training data, etc. Finally, we show that KI could be applied to domain adaptation and knowledge transfer.
ELLE: Efficient Lifelong Pre-training for Emerging Data
Yujia Qin
|
Jiajie Zhang
|
Yankai Lin
|
Zhiyuan Liu
|
Peng Li
|
Maosong Sun
|
Jie Zhou
Findings of the Association for Computational Linguistics: ACL 2022
Current pre-trained language models (PLM) are typically trained with static data, ignoring that in real-world scenarios, streaming data of various sources may continuously grow. This requires PLMs to integrate the information from all the sources in a lifelong manner. Although this goal could be achieved by exhaustive pre-training on all the existing data, such a process is known to be computationally expensive. To this end, we propose ELLE, aiming at efficient lifelong pre-training for emerging data. Specifically, ELLE consists of (1) function preserved model expansion, which flexibly expands an existing PLM’s width and depth to improve the efficiency of knowledge acquisition; and (2) pre-trained domain prompts, which disentangle the versatile knowledge learned during pre-training and stimulate the proper knowledge for downstream tasks. We experiment ELLE with streaming data from 5 domains on BERT and GPT. The results show the superiority of ELLE over various lifelong learning baselines in both pre-training efficiency and downstream performances. The codes are publicly available at https://github.com/thunlp/ELLE.
Search
Co-authors
- Yujia Qin 2
- Yankai Lin 2
- Zhiyuan Liu 2
- Peng Li 2
- Maosong Sun 2
- show all...