From Machine Translation to Image Captioning: Training Vision-Language Models for Indigenous Languages of the Americas

Luis Lara; Param Raval

From Machine Translation to Image Captioning: Training Vision-Language Models for Indigenous Languages of the Americas

Abstract

We describe our system for the AmericasNLP 2026 Shared Task on Cultural Image Captioning for Indigenous Languages of the Americas. Our post-training pipeline starts from Aya Vision 32B: the vision-language model is first fine-tuned on machine translation data from prior AmericasNLP shared tasks and then further fine-tuned on the cultural Image Captioning data. This approach uses translation as an intermediate training task, while the final system produces captions directly in the requested Indigenous language rather than translating a Spanish caption afterward. Our experiments show that machine translation fine-tuning is an important initialization step. The resulting fine-tuned vision-language model also shows translation capabilities for the languages considered in this work. In addition, our zero-shot GPT-5.5 submission ranks first in the Maya language track under the official human-evaluation stage.

Anthology ID:: 2026.americasnlp-6.20
Volume:: Proceedings of the Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Manuel Mager, Abteen Ebrahimi, Minh Duc Bui, Robert Pugh, Arturo Oncevay, Luis Chiruzzo, Rolando Coto Solano, Shruti Rijhwani, Katharina Von Der Wense
Venues:: AmericasNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 224–235
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.americasnlp-6.20/
DOI:
Bibkey:
Cite (ACL):: Luis Lara and Param Raval. 2026. From Machine Translation to Image Captioning: Training Vision-Language Models for Indigenous Languages of the Americas. In Proceedings of the Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP), pages 224–235, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: From Machine Translation to Image Captioning: Training Vision-Language Models for Indigenous Languages of the Americas (Lara & Raval, AmericasNLP 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.americasnlp-6.20.pdf
Supplementarymaterial:: 2026.americasnlp-6.20.SupplementaryMaterial.zip

PDF Cite Search Supplementarymaterial Fix data