Abstract
Pretrained multilingual contextual representations have shown great success, but due to the limits of their pretraining data, their benefits do not apply equally to all language varieties. This presents a challenge for language varieties unfamiliar to these models, whose labeled and unlabeled data is too limited to train a monolingual model effectively. We propose the use of additional language-specific pretraining and vocabulary augmentation to adapt multilingual models to low-resource settings. Using dependency parsing of four diverse low-resource language varieties as a case study, we show that these methods significantly improve performance over baselines, especially in the lowest-resource cases, and demonstrate the importance of the relationship between such models’ pretraining data and target language varieties.- Anthology ID:
- 2020.findings-emnlp.118
- Original:
- 2020.findings-emnlp.118v1
- Version 2:
- 2020.findings-emnlp.118v2
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2020
- Month:
- November
- Year:
- 2020
- Address:
- Online
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1324–1334
- Language:
- URL:
- https://aclanthology.org/2020.findings-emnlp.118
- DOI:
- 10.18653/v1/2020.findings-emnlp.118
- Cite (ACL):
- Ethan C. Chau, Lucy H. Lin, and Noah A. Smith. 2020. Parsing with Multilingual BERT, a Small Corpus, and a Small Treebank. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1324–1334, Online. Association for Computational Linguistics.
- Cite (Informal):
- Parsing with Multilingual BERT, a Small Corpus, and a Small Treebank (Chau et al., Findings 2020)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2020.findings-emnlp.118.pdf
- Code
- ethch18/parsing-mbert