Language Clustering for Multilingual Named Entity Recognition

Kyle Shaffer


Abstract
Recent work in multilingual natural language processing has shown progress in various tasks such as natural language inference and joint multilingual translation. Despite success in learning across many languages, challenges arise where multilingual training regimes often boost performance on some languages at the expense of others. For multilingual named entity recognition (NER) we propose a simple technique that groups similar languages together by using embeddings from a pre-trained masked language model, and automatically discovering language clusters in this embedding space. Specifically, we fine-tune an XLM-Roberta model on a language identification task, and use embeddings from this model for clustering. We conduct experiments on 15 diverse languages in the WikiAnn dataset and show our technique largely outperforms three baselines: (1) training a multilingual model jointly on all available languages, (2) training one monolingual model per language, and (3) grouping languages by linguistic family. We also conduct analyses showing meaningful multilingual transfer for low-resource languages (Swahili and Yoruba), despite being automatically grouped with other seemingly disparate languages.
Anthology ID:
2021.findings-emnlp.4
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2021
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
Findings
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
40–45
Language:
URL:
https://preview.aclanthology.org/build-pipeline-with-new-library/2021.findings-emnlp.4/
DOI:
10.18653/v1/2021.findings-emnlp.4
Bibkey:
Cite (ACL):
Kyle Shaffer. 2021. Language Clustering for Multilingual Named Entity Recognition. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 40–45, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Language Clustering for Multilingual Named Entity Recognition (Shaffer, Findings 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/build-pipeline-with-new-library/2021.findings-emnlp.4.pdf
Video:
 https://preview.aclanthology.org/build-pipeline-with-new-library/2021.findings-emnlp.4.mp4
Data
CoNLL 2003