ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning

Yichen Lu; Wei Dai; Jiaen Liu; Ching Wing Kwok; Zongheng Wu; Xudong Xiao; Ao Sun; Sheng Fu; Jianyuan Zhan; Yian Wang; Takatomo Saito; Sicheng Lai

doi:10.18653/v1/2025.emnlp-demos.17

ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning

Yichen Lu, Wei Dai, Jiaen Liu, Ching Wing Kwok, Zongheng Wu, Xudong Xiao, Ao Sun, Sheng Fu, Jianyuan Zhan, Yian Wang, Takatomo Saito, Sicheng Lai

Abstract

LLM-based translation agents have achieved highly human-like translation results and are capable of handling longer and more complex contexts with greater efficiency. However, they are typically limited to text-only inputs. In this paper, we introduce ViDove, a translation agent system designed for multimodal input. Inspired by the workflow of human translators, ViDove leverages visual and contextual background information to enhance the translation process. Additionally, we integrate a multimodal memory system and long-short term memory modules enriched with domain-specific knowledge, enabling the agent to perform more accurately and adaptively in real-world scenarios. As a result, ViDove achieves significantly higher translation quality in both subtitle generation and general translation tasks, with a 28% improvement in BLEU scores and a 15% improvement in SubER compared to previous state-of-the-art baselines. Moreover, we introduce DoveBench, a new benchmark for long-form automatic video subtitling and translation, featuring 17 hours of high-quality, human-annotated data. Our demo is available here: https://vidove.willbe03.com/

Anthology ID:: 2025.emnlp-demos.17
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Ivan Habernal, Peter Schulam, Jörg Tiedemann
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 228–243
Language:
URL:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.emnlp-demos.17/
DOI:: 10.18653/v1/2025.emnlp-demos.17
Bibkey:
Cite (ACL):: Yichen Lu, Wei Dai, Jiaen Liu, Ching Wing Kwok, Zongheng Wu, Xudong Xiao, Ao Sun, Sheng Fu, Jianyuan Zhan, Yian Wang, Takatomo Saito, and Sicheng Lai. 2025. ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 228–243, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning (Lu et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.emnlp-demos.17.pdf

PDF Cite Search Fix data