ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning
Yichen Lu, Wei Dai, Jiaen Liu, Ching Wing Kwok, Zongheng Wu, Xudong Xiao, Ao Sun, Sheng Fu, Jianyuan Zhan, Yian Wang, Takatomo Saito, Sicheng Lai
Abstract
LLM-based translation agents have achieved highly human-like translation results and are capable of handling longer and more complex contexts with greater efficiency. However, they are typically limited to text-only inputs. In this paper, we introduce ViDove, a translation agent system designed for multimodal input. Inspired by the workflow of human translators, ViDove leverages visual and contextual background information to enhance the translation process. Additionally, we integrate a multimodal memory system and long-short term memory modules enriched with domain-specific knowledge, enabling the agent to perform more accurately and adaptively in real-world scenarios. As a result, ViDove achieves significantly higher translation quality in both subtitle generation and general translation tasks, with a 28% improvement in BLEU scores and a 15% improvement in SubER compared to previous state-of-the-art baselines. Moreover, we introduce DoveBench, a new benchmark for long-form automatic video subtitling and translation, featuring 17 hours of high-quality, human-annotated data. Our demo is available here: https://vidove.willbe03.com/- Anthology ID:
- 2025.emnlp-demos.17
- Volume:
- Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Ivan Habernal, Peter Schulam, Jörg Tiedemann
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 228–243
- Language:
- URL:
- https://preview.aclanthology.org/name-variant-enfa-fane/2025.emnlp-demos.17/
- DOI:
- 10.18653/v1/2025.emnlp-demos.17
- Cite (ACL):
- Yichen Lu, Wei Dai, Jiaen Liu, Ching Wing Kwok, Zongheng Wu, Xudong Xiao, Ao Sun, Sheng Fu, Jianyuan Zhan, Yian Wang, Takatomo Saito, and Sicheng Lai. 2025. ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 228–243, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning (Lu et al., EMNLP 2025)
- PDF:
- https://preview.aclanthology.org/name-variant-enfa-fane/2025.emnlp-demos.17.pdf