Multimodal Machine Translation with Text-Image In-depth Questioning

Yue Gao, Jing Zhao, Shiliang Sun, Xiaosong Qiao, Tengfei Song, Hao Yang


Abstract
Multimodal machine translation (MMT) integrates visual information to address ambiguity and contextual limitations in neural machine translation (NMT). Some empirical studies have revealed that many MMT models underutilize visual data during translation. They attempt to enhance cross-modal interactions to enable better exploitation of visual data. However, they only focus on simple interactions between nouns in text and corresponding entities in image, overlooking global semantic alignment, particularly for prepositional phrases and verbs in text which are more likely to be translated incorrectly. To address this, we design a Text-Image In-depth Questioning method to deepen interactions and optimize translations. Furthermore, to mitigate errors arising from contextually irrelevant image noise, we propose a Consistency Constraint strategy to improve our approach’s robustness. Our approach achieves state-of-the-art results on five translation directions of Multi30K and AmbigCaps, with +2.35 BLEU on the challenging MSCOCO benchmark, validating our method’s effectiveness in utilizing visual data and capturing comprehensive textual semantics.
Anthology ID:
2025.findings-acl.483
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9274–9287
Language:
URL:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.483/
DOI:
Bibkey:
Cite (ACL):
Yue Gao, Jing Zhao, Shiliang Sun, Xiaosong Qiao, Tengfei Song, and Hao Yang. 2025. Multimodal Machine Translation with Text-Image In-depth Questioning. In Findings of the Association for Computational Linguistics: ACL 2025, pages 9274–9287, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Multimodal Machine Translation with Text-Image In-depth Questioning (Gao et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.483.pdf