End-to-End Multilingual Automatic Dubbing via Duration-based Translation with Large Language Models

Hyun-Sik Won; DongJin Jeong; Hyunkyu Choi; Jinwon Kim

End-to-End Multilingual Automatic Dubbing via Duration-based Translation with Large Language Models

Hyun-Sik Won, DongJin Jeong, Hyunkyu Choi, Jinwon Kim

Abstract

Automatic dubbing (AD) aims to replace the original speech in a video with translated speech that maintains precise temporal alignment (isochrony). Achieving natural synchronization between dubbed speech and visual content remains challenging due to variations in speech durations across languages. To address this, we propose an end-to-end AD framework that leverages large language models (LLMs) to integrate translation and timing control seamlessly. At the core of our framework lies Duration-based Translation (DT), a method that dynamically predicts the optimal phoneme count based on source speech duration and iteratively adjusts the translation length accordingly. Our experiments on English, Spanish, and Korean language pairs demonstrate that our approach substantially improves speech overlap—achieving up to 24% relative gains compared to translations without explicit length constraints—while maintaining competitive translation quality measured by COMET scores. Furthermore, our framework does not require language-specific tuning, ensuring practicality for multilingual dubbing scenarios.

Anthology ID:: 2025.emnlp-demos.37
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Ivan Habernal, Peter Schulam, Jörg Tiedemann
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 515–521
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-demos.37/
DOI:
Bibkey:
Cite (ACL):: Hyun-Sik Won, DongJin Jeong, Hyunkyu Choi, and Jinwon Kim. 2025. End-to-End Multilingual Automatic Dubbing via Duration-based Translation with Large Language Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 515–521, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: End-to-End Multilingual Automatic Dubbing via Duration-based Translation with Large Language Models (Won et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-demos.37.pdf

PDF Cite Search Fix data