Christopher Driggers-Ellis

2026

Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing
Aashish Dhawan | Christopher Driggers-Ellis | Christan Grant | Daisy Zhe Wang
Proceedings for the Ninth Workshop on Technologies for Machine Translation of Low Resource Languages (LoResMT 2026)

Machine translation for Indigenous and other low-resource languages is constrained by limited parallel data, orthographic variation, and evaluation instability for morphologically rich languages. In this work, we study Spanish–Aymara, Spanish–Guarani, and Spanish–Quechua translation in the context of the AmericasNLP benchmarks, focusing on data-centric improvements rather than architectural changes. We augment curated parallel corpora with forward-translated synthetic sentence pairs generated using a high-capacity multilingual translation model, while applying conservative, language-specific preprocessing tailored to each language. Training data is filtered using length-ratio constraints and deduplication, whereas official development sets are left unfiltered to ensure fair evaluation. We fine-tune a multilingual mBART model under curated-only and curated+synthetic settings and evaluate performance primarily using chrF++, which is better suited for agglutinative languages than BLEU. Across all three languages, synthetic data augmentation consistently improves chrF++, with the largest gains observed for Aymara and Guarani, while Quechua benefits primarily from deterministic orthographic normalization. Our analysis highlights both the effectiveness and the limitations of generic preprocessing for highly agglutinative languages, suggesting that data-centric augmentation and language-aware normalization are strong, reproducible baselines for low-resource Indigenous language machine translation.

pdf bib abs

Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task
Aashish Dhawan | Christopher Driggers-Ellis | Dzmitry Kasinets | Christan Grant | Zhe Wang
Proceedings of the Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP)

This paper presents the University of Florida Gators submission to the AmericasNLP 2026 shared task on cultural image captioning for Indigenous languages. The system uses a two-stage pipeline: first generating Spanish captions from images with a vision-language model, then translating them into target languages using retrieval-augmented many-shot prompting with Gemini 2.5 Flash. The paper reports strong improvements over the shared task baseline across multiple languages, analyzes the role of retrieval, synthetic exemplars, and morphology-aware prompting, and discusses limitations related to dev-set exemplars, cascade errors, and chrF++ based evaluation.

pdf bib abs

Formal Machine Interpretation for the Semasiographic Mixtec Codices of Precolonial and Early Colonial Mesoamerica
Christopher Driggers-Ellis | Gabriel Ayoubi | Girish.Salunke811@Gmail.Com Girish.Salunke811@Gmail.Com | Christan Grant
Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR)

The precolonial and early colonial Mixtec codices describe the history and stories of the region in a semasiographic medium that is full of symbolic representations and meant to be narrated.Recently, the community has introduced datasets of XML representations of related media, including Aztec codices and Mayan hieroglyphic script, in a step towards symbolic machine interpretation of these historic Mesoamerican artifacts.In this work, we propose formal symbolic machine interpretation of XML encodings representing facsimile images from the Mixtec Codex Zouche-Nuttal.We demonstrate the efficacy of symbolic machine interpretation from XML step-by-step, showing how our parser and interpreter process text capturing a scene from the Mixtec Codex Zouche-Nuttall.We hope our contribution and the example we provide motivate collaboration among the archaeological, historical, linguistic, and natural language processing research communities to apply machine interpretation to Mixtec codices and similar manuscripts.

Co-authors

Daisy Zhe Wang 1

Zhe Wang 1

Venues

Fix author