A Picture is Worth a Thousand (Correct) Captions: A Vision-Guided Judge-Corrector System for Multimodal Machine Translation

Siddharth Betala, Kushan Raj, Vipul Betala, Rohan Saswade


Abstract
In this paper, we describe our system under the team name BLEU Monday for the English-to-Indic Multimodal Translation Task at WAT 2025. We participate in the text-only translation tasks for English-Hindi, English-Bengali, English-Malayalam, and English-Odia language pairs. We present a two-stage approach that addresses quality issues in the training data through automated error detection and correction, followed by parameter-efficient model fine-tuning.Our methodology introduces a vision-augmented judge-corrector pipeline that leverages multimodal language models to systematically identify and correct translation errors in the training data. The judge component classifies translations into three categories: correct, visually ambiguous (requiring image context), or mistranslated (poor translation quality). Identified errors are routed to specialized correctors: GPT-4o-mini regenerates captions requiring visual disambiguation, while IndicTrans2 retranslates cases with pure translation quality issues. This automated pipeline processes 28,928 training examples across four languages, correcting an average of 17.1% of captions per language.We then apply Low-Rank Adaptation (LoRA) to fine-tune the IndicTrans2 en-indic 200M distilled model on both original and corrected datasets. Training on corrected data yields consistent improvements, with BLEU score gains of +1.30 for English-Bengali on the evaluation set (42.00 → 43.30) and +0.70 on the challenge set (44.90 → 45.60), +0.60 for English-Odia on the evaluation set (41.00 → 41.60), and +0.10 for English-Hindi on the challenge set (53.90 → 54.00).
Anthology ID:
2025.wat-1.13
Volume:
Proceedings of the Twelfth Workshop on Asian Translation (WAT 2025)
Month:
December
Year:
2025
Address:
Mumbai, India
Editors:
Toshiaki Nakazawa, Isao Goto
Venues:
WAT | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
124–137
Language:
URL:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.wat-1.13/
DOI:
Bibkey:
Cite (ACL):
Siddharth Betala, Kushan Raj, Vipul Betala, and Rohan Saswade. 2025. A Picture is Worth a Thousand (Correct) Captions: A Vision-Guided Judge-Corrector System for Multimodal Machine Translation. In Proceedings of the Twelfth Workshop on Asian Translation (WAT 2025), pages 124–137, Mumbai, India. Association for Computational Linguistics.
Cite (Informal):
A Picture is Worth a Thousand (Correct) Captions: A Vision-Guided Judge-Corrector System for Multimodal Machine Translation (Betala et al., WAT 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.wat-1.13.pdf