AfroLID: A Neural Language Identification Tool for African Languages
Ife Adebara, AbdelRahim Elmadany, Muhammad Abdul-Mageed, Alcides Inciarte
Abstract
Language identification (LID) is a crucial precursor for NLP, especially for mining web data. Problematically, most of the world’s 7000+ languages today are not covered by LID technologies. We address this pressing issue for Africa by introducing AfroLID, a neural LID toolkit for 517 African languages and varieties. AfroLID exploits a multi-domain web dataset manually curated from across 14 language families utilizing five orthographic systems. When evaluated on our blind Test set, AfroLID achieves 95.89 F_1-score. We also compare AfroLID to five existing LID tools that each cover a small number of African languages, finding it to outperform them on most languages. We further show the utility of AfroLID in the wild by testing it on the acutely under-served Twitter domain. Finally, we offer a number of controlled case studies and perform a linguistically-motivated error analysis that allow us to both showcase AfroLID’s powerful capabilities and limitations- Anthology ID:
- 2022.emnlp-main.128
- Volume:
- Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
- Month:
- December
- Year:
- 2022
- Address:
- Abu Dhabi, United Arab Emirates
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1958–1981
- Language:
- URL:
- https://aclanthology.org/2022.emnlp-main.128
- DOI:
- Cite (ACL):
- Ife Adebara, AbdelRahim Elmadany, Muhammad Abdul-Mageed, and Alcides Inciarte. 2022. AfroLID: A Neural Language Identification Tool for African Languages. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1958–1981, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Cite (Informal):
- AfroLID: A Neural Language Identification Tool for African Languages (Adebara et al., EMNLP 2022)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2022.emnlp-main.128.pdf