GeezSwitch: Language Identification in Typologically Related Low-resourced East African Languages

Fitsum Gaim; Wonsuk Yang; Jong C. Park

GeezSwitch: Language Identification in Typologically Related Low-resourced East African Languages

Abstract

Language identification is one of the fundamental tasks in natural language processing that is a prerequisite to data processing and numerous applications. Low-resourced languages with similar typologies are generally confused with each other in real-world applications such as machine translation, affecting the user’s experience. In this work, we present a language identification dataset for five typologically and phylogenetically related low-resourced East African languages that use the Ge’ez script as a writing system; namely Amharic, Blin, Ge’ez, Tigre, and Tigrinya. The dataset is built automatically from selected data sources, but we also performed a manual evaluation to assess its quality. Our approach to constructing the dataset is cost-effective and applicable to other low-resource languages. We integrated the dataset into an existing language-identification tool and also fine-tuned several Transformer based language models, achieving very strong results in all cases. While the task of language identification is easy for the informed person, such datasets can make a difference in real-world deployments and also serve as part of a benchmark for language understanding in the target languages. The data and models are made available at https://github.com/fgaim/geezswitch.

Anthology ID:: 2022.lrec-1.707
Volume:: Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:: June
Year:: 2022
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 6578–6584
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2022.lrec-1.707/
DOI:
Bibkey:
Cite (ACL):: Fitsum Gaim, Wonsuk Yang, and Jong C. Park. 2022. GeezSwitch: Language Identification in Typologically Related Low-resourced East African Languages. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6578–6584, Marseille, France. European Language Resources Association.
Cite (Informal):: GeezSwitch: Language Identification in Typologically Related Low-resourced East African Languages (Gaim et al., LREC 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2022.lrec-1.707.pdf

PDF Cite Search Fix data