Multilingual ELMo and the Effects of Corpus Sampling
Vinit Ravishankar, Andrey Kutuzov, Lilja Øvrelid, Erik Velldal
Abstract
Multilingual pretrained language models are rapidly gaining popularity in NLP systems for non-English languages. Most of these models feature an important corpus sampling step in the process of accumulating training data in different languages, to ensure that the signal from better resourced languages does not drown out poorly resourced ones. In this study, we train multiple multilingual recurrent language models, based on the ELMo architecture, and analyse both the effect of varying corpus size ratios on downstream performance, as well as the performance difference between monolingual models for each language, and broader multilingual language models. As part of this effort, we also make these trained models available for public use.- Anthology ID:
- 2021.nodalida-main.41
- Volume:
- Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)
- Month:
- May 31--2 June
- Year:
- 2021
- Address:
- Reykjavik, Iceland (Online)
- Editors:
- Simon Dobnik, Lilja Øvrelid
- Venue:
- NoDaLiDa
- SIG:
- Publisher:
- Linköping University Electronic Press, Sweden
- Note:
- Pages:
- 378–384
- Language:
- URL:
- https://aclanthology.org/2021.nodalida-main.41
- DOI:
- Cite (ACL):
- Vinit Ravishankar, Andrey Kutuzov, Lilja Øvrelid, and Erik Velldal. 2021. Multilingual ELMo and the Effects of Corpus Sampling. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pages 378–384, Reykjavik, Iceland (Online). Linköping University Electronic Press, Sweden.
- Cite (Informal):
- Multilingual ELMo and the Effects of Corpus Sampling (Ravishankar et al., NoDaLiDa 2021)
- PDF:
- https://preview.aclanthology.org/emnlp22-frontmatter/2021.nodalida-main.41.pdf
- Data
- XNLI