Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing

Jeongsoo Choi; Jaehun Kim; Joon Son Chung

doi:10.18653/v1/2025.findings-emnlp.524

Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing

Jeongsoo Choi, Jaehun Kim, Joon Son Chung

Abstract

This paper introduces a cross-lingual dubbing system that translates speech from one language to another while preserving key characteristics such as duration, speaker identity, and speaking speed. Despite the strong translation quality of existing speech translation approaches, they often overlook the transfer of speech patterns, leading to mismatches with source speech and limiting their suitability for dubbing applications. To address this, we propose a discrete diffusion-based speech-to-unit translation model with explicit duration control, enabling time-aligned translation. We then synthesize speech based on the translated units and source speaker’s identity using a conditional flow matching model. Additionally, we introduce a unit-based speed adaptation mechanism that guides the translation model to produce speech at a rate consistent with the source, without relying on any text. Extensive experiments demonstrate that our framework generates natural and fluent translations that align with the original speech’s duration and speaking pace, while achieving competitive translation performance.

Anthology ID:: 2025.findings-emnlp.524
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9871–9881
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.524/
DOI:: 10.18653/v1/2025.findings-emnlp.524
Bibkey:
Cite (ACL):: Jeongsoo Choi, Jaehun Kim, and Joon Son Chung. 2025. Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 9871–9881, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing (Choi et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.524.pdf
Checklist:: 2025.findings-emnlp.524.checklist.pdf

PDF Cite Search Checklist Fix data