Abstract
Recent work in multilingual natural language processing has shown progress in various tasks such as natural language inference and joint multilingual translation. Despite success in learning across many languages, challenges arise where multilingual training regimes often boost performance on some languages at the expense of others. For multilingual named entity recognition (NER) we propose a simple technique that groups similar languages together by using embeddings from a pre-trained masked language model, and automatically discovering language clusters in this embedding space. Specifically, we fine-tune an XLM-Roberta model on a language identification task, and use embeddings from this model for clustering. We conduct experiments on 15 diverse languages in the WikiAnn dataset and show our technique largely outperforms three baselines: (1) training a multilingual model jointly on all available languages, (2) training one monolingual model per language, and (3) grouping languages by linguistic family. We also conduct analyses showing meaningful multilingual transfer for low-resource languages (Swahili and Yoruba), despite being automatically grouped with other seemingly disparate languages.- Anthology ID:
- 2021.findings-emnlp.4
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2021
- Month:
- November
- Year:
- 2021
- Address:
- Punta Cana, Dominican Republic
- Editors:
- Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
- Venue:
- Findings
- SIG:
- SIGDAT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 40–45
- Language:
- URL:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2021.findings-emnlp.4/
- DOI:
- 10.18653/v1/2021.findings-emnlp.4
- Cite (ACL):
- Kyle Shaffer. 2021. Language Clustering for Multilingual Named Entity Recognition. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 40–45, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Cite (Informal):
- Language Clustering for Multilingual Named Entity Recognition (Shaffer, Findings 2021)
- PDF:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2021.findings-emnlp.4.pdf
- Data
- CoNLL 2003