Do not Mask Randomly: Effective Domain-adaptive Pre-training by Masking In-domain Keywords
Shahriar Golchin, Mihai Surdeanu, Nazgol Tavabi, Ata Kiapour
Abstract
We propose a novel task-agnostic in-domain pre-training method that sits between generic pre-training and fine-tuning. Our approach selectively masks in-domain keywords, i.e., words that provide a compact representation of the target domain. We identify such keywords using KeyBERT (Grootendorst, 2020). We evaluate our approach using six different settings: three datasets combined with two distinct pre-trained language models (PLMs). Our results reveal that the fine-tuned PLMs adapted using our in-domain pre-training strategy outperform PLMs that used in-domain pre-training with random masking as well as those that followed the common pre-train-then-fine-tune paradigm. Further, the overhead of identifying in-domain keywords is reasonable, e.g., 7-15% of the pre-training time (for two epochs) for BERT Large (Devlin et al., 2019).- Anthology ID:
- 2023.repl4nlp-1.2
- Volume:
- Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023)
- Month:
- July
- Year:
- 2023
- Address:
- Toronto, Canada
- Editors:
- Burcu Can, Maximilian Mozes, Samuel Cahyawijaya, Naomi Saphra, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Chen Zhao, Isabelle Augenstein, Anna Rogers, Kyunghyun Cho, Edward Grefenstette, Lena Voita
- Venue:
- RepL4NLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 13–21
- Language:
- URL:
- https://preview.aclanthology.org/jlcl-multiple-ingestion/2023.repl4nlp-1.2/
- DOI:
- 10.18653/v1/2023.repl4nlp-1.2
- Cite (ACL):
- Shahriar Golchin, Mihai Surdeanu, Nazgol Tavabi, and Ata Kiapour. 2023. Do not Mask Randomly: Effective Domain-adaptive Pre-training by Masking In-domain Keywords. In Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023), pages 13–21, Toronto, Canada. Association for Computational Linguistics.
- Cite (Informal):
- Do not Mask Randomly: Effective Domain-adaptive Pre-training by Masking In-domain Keywords (Golchin et al., RepL4NLP 2023)
- PDF:
- https://preview.aclanthology.org/jlcl-multiple-ingestion/2023.repl4nlp-1.2.pdf