Bringing Mapudungun into the Modern MT Ecosystem: Morphology-Aware Tokenization for NLLB-200 Fine-Tuning

Isaac Thompson, Brandon Rogers, Eric Ringger


Abstract
For Mapudungun arn→es translation, morphology-aware tokenization can substitute for a 5× increase in model parameters. We fine-tune three sizes of Meta’s NLLB-200 on Mapudungun–Spanish translation across eight tokenization strategies, including our novel Morfessor-VC method, whichconstrains Morfessor morpheme segmentation to tokens already present in NLLB’s pretrainedvocabulary. Our 600M Morfessor-VC model is competitive with our own fine-tuned 3.3B Standard BPE model on arn→es (43.2 vs. 42.9 chrF++, ∆ = +0.3, p = 0.039, 95% CI [0.02, 0.60]) while using five times fewer parameters, and all fine-tuned conditions surpass frontier LLMs by over 27 chrF++. Mapudungun is an indigenous polysynthetic language spoken by 200,000+ Mapuche people in Chile and Argentina, absent from NLLB-200 and not supported by major commercial MT providers; prior work predates large-scale multilingual models and does not address the tokenization challenges posed by its agglutinativemorphology. These results establish new state-of-the-art baselines for Mapudungun MT and provide a practical foundation for community language tools in pedagogy, social media, and language revitalization.
Anthology ID:
2026.americasnlp-6.16
Volume:
Proceedings of the Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Manuel Mager, Abteen Ebrahimi, Minh Duc Bui, Robert Pugh, Arturo Oncevay, Luis Chiruzzo, Rolando Coto Solano, Shruti Rijhwani, Katharina Von Der Wense
Venues:
AmericasNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
173–185
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.americasnlp-6.16/
DOI:
Bibkey:
Cite (ACL):
Isaac Thompson, Brandon Rogers, and Eric Ringger. 2026. Bringing Mapudungun into the Modern MT Ecosystem: Morphology-Aware Tokenization for NLLB-200 Fine-Tuning. In Proceedings of the Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP), pages 173–185, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Bringing Mapudungun into the Modern MT Ecosystem: Morphology-Aware Tokenization for NLLB-200 Fine-Tuning (Thompson et al., AmericasNLP 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.americasnlp-6.16.pdf
Supplementarymaterial:
 2026.americasnlp-6.16.SupplementaryMaterial.zip