Abstract
Knowing the language of an input text/audio is a necessary first step for using almost every NLP tool such as taggers, parsers, or translation systems. Language identification is a well-studied problem, sometimes even considered solved; in reality, due to lack of data and computational challenges, current systems cannot accurately identify most of the world’s 7000 languages. To tackle this bottleneck, we first compile a corpus, MCS-350, of 50K multilingual and parallel children’s stories in 350+ languages. MCS-350 can serve as a benchmark for language identification of short texts and for 1400+ new translation directions in low-resource Indian and African languages. Second, we propose a novel misprediction-resolution hierarchical model, LIMIT, for language identification that reduces error by 55% (from 0.71 to 0.32) on our compiled children’s stories dataset and by 40% (from 0.23 to 0.14) on the FLORES-200 benchmark. Our method can expand language identification coverage into low-resource languages by relying solely on systemic misprediction patterns, bypassing the need to retrain large models from scratch.- Anthology ID:
- 2023.emnlp-main.895
- Volume:
- Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Houda Bouamor, Juan Pino, Kalika Bali
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 14496–14519
- Language:
- URL:
- https://aclanthology.org/2023.emnlp-main.895
- DOI:
- 10.18653/v1/2023.emnlp-main.895
- Cite (ACL):
- Milind Agarwal, Md Mahfuz Ibn Alam, and Antonios Anastasopoulos. 2023. LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14496–14519, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages (Agarwal et al., EMNLP 2023)
- PDF:
- https://preview.aclanthology.org/emnlp-22-attachments/2023.emnlp-main.895.pdf