AfroLID: A Neural Language Identification Tool for African Languages

Ife Adebara, AbdelRahim Elmadany, Muhammad Abdul-Mageed, Alcides Inciarte


Abstract
Language identification (LID) is a crucial precursor for NLP, especially for mining web data. Problematically, most of the world’s 7000+ languages today are not covered by LID technologies. We address this pressing issue for Africa by introducing AfroLID, a neural LID toolkit for 517 African languages and varieties. AfroLID exploits a multi-domain web dataset manually curated from across 14 language families utilizing five orthographic systems. When evaluated on our blind Test set, AfroLID achieves 95.89 F_1-score. We also compare AfroLID to five existing LID tools that each cover a small number of African languages, finding it to outperform them on most languages. We further show the utility of AfroLID in the wild by testing it on the acutely under-served Twitter domain. Finally, we offer a number of controlled case studies and perform a linguistically-motivated error analysis that allow us to both showcase AfroLID’s powerful capabilities and limitations
Anthology ID:
2022.emnlp-main.128
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1958–1981
Language:
URL:
https://aclanthology.org/2022.emnlp-main.128
DOI:
Bibkey:
Cite (ACL):
Ife Adebara, AbdelRahim Elmadany, Muhammad Abdul-Mageed, and Alcides Inciarte. 2022. AfroLID: A Neural Language Identification Tool for African Languages. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1958–1981, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
AfroLID: A Neural Language Identification Tool for African Languages (Adebara et al., EMNLP 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-ingestion/2022.emnlp-main.128.pdf