Does Vision Still Help? Multimodal Translation with CLIP-Based Image Selection

Deepak Kumar, Baban Gain, Kshetrimayum Boynao Singh, Asif Ekbal


Abstract
Multimodal Machine Translation aims to enhance conventional text-only translation systems by incorporating visual context, typically in the form of images paired with captions. In this work, we present our submission to the WAT 2025 Multimodal Translation Shared Task, which explores the role of visual information in translating English captions into four Indic languages: Hindi, Bengali, Malayalam, and Odia. Our system builds upon the strong multilingual text translation backbone IndicTrans, augmented with a CLIP-based selective visual grounding mechanism. Specifically, we compute cosine similarities between text and image embeddings (both full and cropped regions) and automatically select the most semantically aligned image representation to integrate into the translation model. We observe that overall contribution of visual features is questionable. Our findings reaffirm recent evidence that large multilingual translation models can perform competitively without explicit visual grounding.
Anthology ID:
2025.wat-1.12
Volume:
Proceedings of the Twelfth Workshop on Asian Translation (WAT 2025)
Month:
December
Year:
2025
Address:
Mumbai, India
Editors:
Toshiaki Nakazawa, Isao Goto
Venues:
WAT | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
115–123
Language:
URL:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.wat-1.12/
DOI:
Bibkey:
Cite (ACL):
Deepak Kumar, Baban Gain, Kshetrimayum Boynao Singh, and Asif Ekbal. 2025. Does Vision Still Help? Multimodal Translation with CLIP-Based Image Selection. In Proceedings of the Twelfth Workshop on Asian Translation (WAT 2025), pages 115–123, Mumbai, India. Association for Computational Linguistics.
Cite (Informal):
Does Vision Still Help? Multimodal Translation with CLIP-Based Image Selection (Kumar et al., WAT 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.wat-1.12.pdf