exBERT: Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources

Wen Tai; H. T. Kung; Xin Luna Dong; Marcus Comiter; Chang-Fu Kuo

doi:10.18653/v1/2020.findings-emnlp.129

exBERT: Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources

Wen Tai, H. T. Kung, Xin Dong, Marcus Comiter, Chang-Fu Kuo

Abstract

We introduce exBERT, a training method to extend BERT pre-trained models from a general domain to a new pre-trained model for a specific domain with a new additive vocabulary under constrained training resources (i.e., constrained computation and data). exBERT uses a small extension module to learn to adapt an augmenting embedding for the new domain in the context of the original BERT’s embedding of a general vocabulary. The exBERT training method is novel in learning the new vocabulary and the extension module while keeping the weights of the original BERT model fixed, resulting in a substantial reduction in required training resources. We pre-train exBERT with biomedical articles from ClinicalKey and PubMed Central, and study its performance on biomedical downstream benchmark tasks using the MTL-Bioinformatics-2016 datasets. We demonstrate that exBERT consistently outperforms prior approaches when using limited corpus and pre-training computation resources.

Anthology ID:: 2020.findings-emnlp.129
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2020
Month:: November
Year:: 2020
Address:: Online
Editors:: Trevor Cohn, Yulan He, Yang Liu
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1433–1439
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2020.findings-emnlp.129/
DOI:: 10.18653/v1/2020.findings-emnlp.129
Bibkey:
Cite (ACL):: Wen Tai, H. T. Kung, Xin Dong, Marcus Comiter, and Chang-Fu Kuo. 2020. exBERT: Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1433–1439, Online. Association for Computational Linguistics.
Cite (Informal):: exBERT: Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources (Tai et al., Findings 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2020.findings-emnlp.129.pdf

PDF Cite Search Fix data