Video-guided Machine Translation: A Survey of Models, Datasets, and Challenges

Pinaki Das, Virendra Singh, Pushpak Bhattacharyya, Gholamreza Haffari


Abstract
In recent years, machine translation has evolved with the integration of multimodal information. Infusion of multi-modality into translation tasks decreases ambiguation and enhances translation scores. Common modalities include images, speech, and videos, which provide additional context alongside the text to be translated. While multimodal translation with images has been extensively studied, video-guided machine translation (VMT) has gained increasing attention, particularly since Wang et al. 2019 first explored this task. In this paper, we provide a comprehensive overview of VMT, highlighting its unique challenges, methodologies, and recent advancements. Unlike previous surveys that primarily focus on image-guided multimodal machine translation, this work explores the distinct complexities and opportunities introduced by adding video as a modality to the translation task.
Anthology ID:
2025.ijcnlp-long.178
Volume:
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Month:
December
Year:
2025
Address:
Mumbai, India
Editors:
Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, Dhirendra Pratap Singh
Venues:
IJCNLP | AACL
SIG:
Publisher:
The Asian Federation of Natural Language Processing and The Association for Computational Linguistics
Note:
Pages:
3346–3356
Language:
URL:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-long.178/
DOI:
Bibkey:
Cite (ACL):
Pinaki Das, Virendra Singh, Pushpak Bhattacharyya, and Gholamreza Haffari. 2025. Video-guided Machine Translation: A Survey of Models, Datasets, and Challenges. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 3346–3356, Mumbai, India. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics.
Cite (Informal):
Video-guided Machine Translation: A Survey of Models, Datasets, and Challenges (Das et al., IJCNLP-AACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-long.178.pdf