VE-KD: Vocabulary-Expansion Knowledge-Distillation for Training Smaller Domain-Specific Language Models

Pengju Gao; Tomohiro Yamasaki; Kazunori Imoto

doi:10.18653/v1/2024.findings-emnlp.884

VE-KD: Vocabulary-Expansion Knowledge-Distillation for Training Smaller Domain-Specific Language Models

Pengju Gao, Tomohiro Yamasaki, Kazunori Imoto

Abstract

We propose VE-KD, a novel method that balances knowledge distillation and vocabulary expansion with the aim of training efficient domain-specific language models. Compared with traditional pre-training approaches, VE-KD exhibits competitive performance in downstream tasks while reducing model size and using fewer computational resources. Additionally, VE-KD refrains from overfitting in domain adaptation. Our experiments with different biomedical domain tasks demonstrate that VE-KD performs well compared with models such as BioBERT (+1% at HoC) and PubMedBERT (+1% at PubMedQA), with about 96% less training time. Furthermore, it outperforms DistilBERT and Adapt-and-Distill, showing a significant improvement in document-level tasks. Investigation of vocabulary size and tolerance, which are hyperparameters of our method, provides insights for further model optimization. The fact that VE-KD consistently maintains its advantages, even when the corpus size is small, suggests that it is a practical approach for domain-specific language tasks and is transferrable to different domains for broader applications.

Anthology ID:: 2024.findings-emnlp.884
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2024
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 15046–15059
Language:
URL:: https://preview.aclanthology.org/add-emnlp-2024-awards/2024.findings-emnlp.884/
DOI:: 10.18653/v1/2024.findings-emnlp.884
Bibkey:
Cite (ACL):: Pengju Gao, Tomohiro Yamasaki, and Kazunori Imoto. 2024. VE-KD: Vocabulary-Expansion Knowledge-Distillation for Training Smaller Domain-Specific Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 15046–15059, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: VE-KD: Vocabulary-Expansion Knowledge-Distillation for Training Smaller Domain-Specific Language Models (Gao et al., Findings 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/add-emnlp-2024-awards/2024.findings-emnlp.884.pdf
Software:: 2024.findings-emnlp.884.software.zip

PDF Cite Search Software Fix data