Investigating variation in written forms of Nahuatl using character-based language models

Robert Pugh, Francis Tyers


Abstract
We describe experiments with character-based language modeling for written variants of Nahuatl. Using a standard LSTM model and publicly available Bible translations, we explore how character language models can be applied to the tasks of estimating mutual intelligibility, identifying genetic similarity, and distinguishing written variants. We demonstrate that these simple language models are able to capture similarities and differences that have been described in the linguistic literature.
Anthology ID:
2021.americasnlp-1.3
Volume:
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas
Month:
June
Year:
2021
Address:
Online
Venue:
AmericasNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
21–27
Language:
URL:
https://aclanthology.org/2021.americasnlp-1.3
DOI:
10.18653/v1/2021.americasnlp-1.3
Bibkey:
Cite (ACL):
Robert Pugh and Francis Tyers. 2021. Investigating variation in written forms of Nahuatl using character-based language models. In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, pages 21–27, Online. Association for Computational Linguistics.
Cite (Informal):
Investigating variation in written forms of Nahuatl using character-based language models (Pugh & Tyers, AmericasNLP 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2021.americasnlp-1.3.pdf
Code
 lguyogiro/nahuatl-variant-charlms-americasnlp