Tengfei Song

2025

Multimodal machine translation (MMT) integrates visual information to address ambiguity and contextual limitations in neural machine translation (NMT). Some empirical studies have revealed that many MMT models underutilize visual data during translation. They attempt to enhance cross-modal interactions to enable better exploitation of visual data. However, they only focus on simple interactions between nouns in text and corresponding entities in image, overlooking global semantic alignment, particularly for prepositional phrases and verbs in text which are more likely to be translated incorrectly. To address this, we design a Text-Image In-depth Questioning method to deepen interactions and optimize translations. Furthermore, to mitigate errors arising from contextually irrelevant image noise, we propose a Consistency Constraint strategy to improve our approach’s robustness. Our approach achieves state-of-the-art results on five translation directions of Multi30K and AmbigCaps, with +2.35 BLEU on the challenging MSCOCO benchmark, validating our method’s effectiveness in utilizing visual data and capturing comprehensive textual semantics.

pdf bib abs
VQA-Augmented Machine Translation with Cross-Modal Contrastive Learning
Zhihui Zhang | Shiliang Sun | Jing Zhao | Tengfei Song | Hao Yang
Findings of the Association for Computational Linguistics: EMNLP 2025

Multimodal machine translation (MMT) aims to enhance translation quality by integrating visual information. However, existing methods often extract visual features using pre-trained models while learning text features from scratch, leading to representation imbalance. These methods are also prone to being misled by redundant visual information, which results in suboptimal performance. To address these challenges, we propose CAMT, a novel cross-modal VQA-augmented MMT method. CAMT aligns image-source text pairs and image-question text pairs through dual-text contrastive learning, thereby improving semantic consistency across modalities. Additionally, we design an effective strategy for generating question–answer pairs to enhance fine-grained alignment and filter out irrelevant visual noise, while also addressing the scarcity of VQA annotations. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of the proposed CAMT framework, which consistently outperforms state-of-the-art MMT methods across multiple evaluation metrics.

pdf bib abs
Imagination and Contemplation: A Balanced Framework for Semantic-Augmented Multimodal Machine Translation
Zhuang Yu | Shiliang Sun | Jing Zhao | Tengfei Song | Hao Yang
Findings of the Association for Computational Linguistics: EMNLP 2025

Multimodal Machine Translation (MMT) enhances textual translation through auxiliary inputs such as images, which is particularly effective in resolving linguistic ambiguities. However, visual information often introduces redundancy or noise, potentially impairing translation quality. To address this challenge, we propose a balanced semantic-augmented framework that integrates “Imagination“ and “Contemplation“ in multimodal understanding. Specifically, we first generate synthetic images from the source text and align them with the authentic images via an optimal transport (OT) loss to enhance visual-semantic consistency. A CLIP-based similarity gating mechanism is introduced to adaptively fuse visual features from both authentic and synthetic images during visual representation learning. To strengthen semantic grounding, a neural machine translation (NMT) branch is incorporated as a regularization signal, and a Kullback-Leibler (KL) divergence is applied between MMT and NMT outputs to mitigate modality mismatch. Furthermore, an image-text contrastive (ITC) loss aligns the final translations with image representations, reinforcing multimodal coherence. Experiments on multiple translation datasets with a diverse set of language pairs demonstrate that our framework outperforms existing baselines, particularly in cases with visually ambiguous or weakly correlated content.

Co-authors

Zhuang Yu 1

Zhihui Zhang 1

Venues

findings3

Fix author