mDAPT: Multilingual Domain Adaptive Pretraining in a Single Model
Rasmus Kær Jørgensen, Mareike Hartmann, Xiang Dai, Desmond Elliott
Abstract
Domain adaptive pretraining, i.e. the continued unsupervised pretraining of a language model on domain-specific text, improves the modelling of text for downstream tasks within the domain. Numerous real-world applications are based on domain-specific text, e.g. working with financial or biomedical documents, and these applications often need to support multiple languages. However, large-scale domain-specific multilingual pretraining data for such scenarios can be difficult to obtain, due to regulations, legislation, or simply a lack of language- and domain-specific text. One solution is to train a single multilingual model, taking advantage of the data available in as many languages as possible. In this work, we explore the benefits of domain adaptive pretraining with a focus on adapting to multiple languages within a specific domain. We propose different techniques to compose pretraining corpora that enable a language model to both become domain-specific and multilingual. Evaluation on nine domain-specific datasets—for biomedical named entity recognition and financial sentence classification—covering seven different languages show that a single multilingual domain-specific model can outperform the general multilingual model, and performs close to its monolingual counterpart. This finding holds across two different pretraining methods, adapter-based pretraining and full model pretraining.- Anthology ID:
- 2021.findings-emnlp.290
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2021
- Month:
- November
- Year:
- 2021
- Address:
- Punta Cana, Dominican Republic
- Venue:
- Findings
- SIG:
- SIGDAT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 3404–3418
- Language:
- URL:
- https://aclanthology.org/2021.findings-emnlp.290
- DOI:
- 10.18653/v1/2021.findings-emnlp.290
- Cite (ACL):
- Rasmus Kær Jørgensen, Mareike Hartmann, Xiang Dai, and Desmond Elliott. 2021. mDAPT: Multilingual Domain Adaptive Pretraining in a Single Model. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3404–3418, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Cite (Informal):
- mDAPT: Multilingual Domain Adaptive Pretraining in a Single Model (Kær Jørgensen et al., Findings 2021)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2021.findings-emnlp.290.pdf
- Code
- rasmuskaer/mdapt_supplements
- Data
- NCBI Disease