From N-grams to Pre-trained Multilingual Models For Language Identification

Thapelo Andrew Sindane, Vukosi Marivate


Abstract
In this paper, we investigate the use of N-gram models and Large Pre-trained Multilingual models for Language Identification (LID) across 11 South African languages. For N-gram models, this study shows that effective data size selection remains crucial for establishing effective frequency distributions of the target languages, that efficiently model each language, thus, improving language ranking. For pre-trained multilingual models, we conduct extensive experiments covering a diverse set of massively pre-trained multilingual (PLM) models – mBERT, RemBERT, XLM-r, and Afri-centric multilingual models – AfriBERTa, Afro-XLMr, AfroLM, and Serengeti. We further compare these models with available large-scale Language Identification tools: Compact Language Detector v3 (CLD V3), AfroLID, GlotLID, and OpenLID to highlight the importance of focused-based LID. From these, we show that Serengeti is a superior model across models: N-grams to Transformers on average. Moreover, we propose a lightweight BERT-based LID model (za_BERT_lid) trained with NHCLT + Vukzenzele corpus, which performs on par with our best-performing Afri-centric models.
Anthology ID:
2024.nlp4dh-1.22
Volume:
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
Month:
November
Year:
2024
Address:
Miami, USA
Editors:
Mika Hämäläinen, Emily Öhman, So Miyagawa, Khalid Alnajjar, Yuri Bizzoni
Venue:
NLP4DH
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
229–239
Language:
URL:
https://aclanthology.org/2024.nlp4dh-1.22
DOI:
10.18653/v1/2024.nlp4dh-1.22
Bibkey:
Cite (ACL):
Thapelo Andrew Sindane and Vukosi Marivate. 2024. From N-grams to Pre-trained Multilingual Models For Language Identification. In Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities, pages 229–239, Miami, USA. Association for Computational Linguistics.
Cite (Informal):
From N-grams to Pre-trained Multilingual Models For Language Identification (Sindane & Marivate, NLP4DH 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2024.nlp4dh-1.22.pdf