Abstract
Although over 100 languages are supported by strong off-the-shelf machine translation systems, only a subset of them possess large annotated corpora for named entity recognition. Motivated by this fact, we leverage machine translation to improve annotation-projection approaches to cross-lingual named entity recognition. We propose a system that improves over prior entity-projection methods by: (a) leveraging machine translation systems twice: first for translating sentences and subsequently for translating entities; (b) matching entities based on orthographic and phonetic similarity; and (c) identifying matches based on distributional statistics derived from the dataset. Our approach improves upon current state-of-the-art methods for cross-lingual named entity recognition on 5 diverse languages by an average of 4.1 points. Further, our method achieves state-of-the-art F_1 scores for Armenian, outperforming even a monolingual model trained on Armenian source data.- Anthology ID:
- D19-1100
- Original:
- D19-1100v1
- Version 2:
- D19-1100v2
- Volume:
- Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
- Month:
- November
- Year:
- 2019
- Address:
- Hong Kong, China
- Venues:
- EMNLP | IJCNLP
- SIG:
- SIGDAT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1083–1092
- Language:
- URL:
- https://aclanthology.org/D19-1100
- DOI:
- 10.18653/v1/D19-1100
- Cite (ACL):
- Alankar Jain, Bhargavi Paranjape, and Zachary C. Lipton. 2019. Entity Projection via Machine Translation for Cross-Lingual NER. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1083–1092, Hong Kong, China. Association for Computational Linguistics.
- Cite (Informal):
- Entity Projection via Machine Translation for Cross-Lingual NER (Jain et al., EMNLP-IJCNLP 2019)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/D19-1100.pdf
- Code
- alankarj/cross_lingual_ner
- Data
- CoNLL 2002, CoNLL-2003, Polyglot-NER