Culturally Grounded Image Captioning in Indigenous Languages with Vision-Language Models: Cascaded and Single-Stage Approaches

Mirelle Bueno; Sushil Garg

Culturally Grounded Image Captioning in Indigenous Languages with Vision-Language Models: Cascaded and Single-Stage Approaches

Abstract

Culturally grounded image captioning for under-resourced Indigenous languages is challenging due to severe data scarcity and the need to describe culturally specific visual content. This paper describes our submission to the AmericasNLP 2026 shared task, where we evaluate two architectural paradigms for caption generation across Bribri, Guaraní, Yucatec Maya, Wixárika, and Orizaba Nahuatl. First, we implement a cascaded system that combines a large vision-language model with a machine translation pipeline, showing that culturally contextualized, persona-based prompting improves over the official baseline in most comparable settings. Second, we develop a direct, end-to-end Single-stage approach by adapting PaliGemma 2 using LoRA fine-tuning, continued pre-training, and multilingual joint training. Our single-stage experiments show that, despite severe domain mismatch and reliance on synthetic training data, multilingual training and continued pre-training improve automatic chrF++ relative to single-language LoRA fine-tuning in some settings. Overall, cascaded pipelines remain the strongest among the evaluated approaches under current data constraints, while single-stage models remain a promising but currently data-limited path toward direct Indigenous-language image captioning.

Anthology ID:: 2026.americasnlp-6.23
Volume:: Proceedings of the Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Manuel Mager, Abteen Ebrahimi, Minh Duc Bui, Robert Pugh, Arturo Oncevay, Luis Chiruzzo, Rolando Coto Solano, Shruti Rijhwani, Katharina Von Der Wense
Venues:: AmericasNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 248–256
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.americasnlp-6.23/
DOI:
Bibkey:
Cite (ACL):: Mirelle Bueno and Sushil Garg. 2026. Culturally Grounded Image Captioning in Indigenous Languages with Vision-Language Models: Cascaded and Single-Stage Approaches. In Proceedings of the Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP), pages 248–256, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: Culturally Grounded Image Captioning in Indigenous Languages with Vision-Language Models: Cascaded and Single-Stage Approaches (Bueno & Garg, AmericasNLP 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.americasnlp-6.23.pdf

PDF Cite Search Fix data