Param Raval

2026

From Machine Translation to Image Captioning: Training Vision-Language Models for Indigenous Languages of the Americas
Luis Lara | Param Raval
Proceedings of the Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP)

We describe our system for the AmericasNLP 2026 Shared Task on Cultural Image Captioning for Indigenous Languages of the Americas. Our post-training pipeline starts from Aya Vision 32B: the vision-language model is first fine-tuned on machine translation data from prior AmericasNLP shared tasks and then further fine-tuned on the cultural Image Captioning data. This approach uses translation as an intermediate training task, while the final system produces captions directly in the requested Indigenous language rather than translating a Spanish caption afterward. Our experiments show that machine translation fine-tuning is an important initialization step. The resulting fine-tuned vision-language model also shows translation capabilities for the languages considered in this work. In addition, our zero-shot GPT-5.5 submission ranks first in the Maya language track under the official human-evaluation stage.

Co-authors

Luis Lara 1

Venues

AmericasNLP1
WS1

Fix author