Bridging the Gap: Leveraging Cherokee to Improve Language Identification for Endangered Iroquoian Languages

Liam Enzo Eggleston, Michael P. Cacioli, Jatin Sarabu, Ivory Yang, Kevin Zhu


Abstract
Language identification is a foundational task in natural language processing (NLP), yet many Indigenous languages remain entirely unsupported by commercial language identification systems. In this study, we assess the performance of Google LangID on a 5k Cherokee dataset and find that every sentence is classified as “undetermined”, indicating a complete failure to even misidentify Cherokee as another language. To further explore this issue, we manually constructed the first digitalized Northern Iroquoian dataset, consisting of 120 sentences across five related languages: Onondaga, Cayuga, Mohawk, Seneca, and Oneida. Running these sentences through Google LangID, we examine patterns in its incorrect predictions. To address these limitations, we train a random forest classifier to successfully distinguish between these languages, demonstrating its effectiveness in language identification. Our findings underscore the inadequacies of existing commercial language identification models for Indigenous languages and highlight concrete steps toward improving automated recognition of low-resource languages.
Anthology ID:
2025.lowresnlp-1.1
Volume:
Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages
Month:
September
Year:
2025
Address:
Varna, Bulgaria
Editors:
Ernesto Luis Estevanell-Valladares, Alicia Picazo-Izquierdo, Tharindu Ranasinghe, Besik Mikaberidze, Simon Ostermann, Daniil Gurgurov, Philipp Mueller, Claudia Borg, Marián Šimko
Venues:
LowResNLP | WS
SIG:
Publisher:
INCOMA Ltd., Shoumen, Bulgaria
Note:
Pages:
1–6
Language:
URL:
https://preview.aclanthology.org/corrections-2026-01/2025.lowresnlp-1.1/
DOI:
Bibkey:
Cite (ACL):
Liam Enzo Eggleston, Michael P. Cacioli, Jatin Sarabu, Ivory Yang, and Kevin Zhu. 2025. Bridging the Gap: Leveraging Cherokee to Improve Language Identification for Endangered Iroquoian Languages. In Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages, pages 1–6, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Cite (Informal):
Bridging the Gap: Leveraging Cherokee to Improve Language Identification for Endangered Iroquoian Languages (Eggleston et al., LowResNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/corrections-2026-01/2025.lowresnlp-1.1.pdf