Luis Tavarez-Arce
2025
Whisper-UT: A Unified Translation Framework for Speech and Text
Cihan Xiao
|
Matthew Wiesner
|
Debashish Chakraborty
|
Reno Kriz
|
Keith Cunningham
|
Kenton Murray
|
Kevin Duh
|
Luis Tavarez-Arce
|
Paul McNamee
|
Sanjeev Khudanpur
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Encoder-decoder models have achieved remarkable success in speech and text tasks, yet efficiently adapting these models to diverse uni/multi-modal scenarios remains an open challenge. In this paper, we propose Whisper-UT, a unified and efficient framework that leverages lightweight adapters to enable seamless adaptation across tasks, including a multi-modal machine translation (MMT) task that explicitly conditions translation on both speech and source language text inputs. By incorporating ASR hypotheses or ground-truth transcripts as prompts, this approach not only enables the system to process both modalities simultaneously but also enhances speech translation (ST) performance through a 2-stage decoding strategy. We demonstrate our methods using the Whisper model, though in principle they are general and could be applied to similar multitask models. We highlight the effectiveness of cross-modal and cross-task fine-tuning, which improves performance without requiring 3-way parallel data. Our results underscore the flexibility, efficiency, and general applicability of the proposed framework for multi-modal translation.
2024
Can Synthetic Speech Improve End-to-End Conversational Speech Translation?
Bismarck Bamfo Odoom
|
Nathaniel Robinson
|
Elijah Rippeth
|
Luis Tavarez-Arce
|
Kenton Murray
|
Matthew Wiesner
|
Paul McNamee
|
Philipp Koehn
|
Kevin Duh
Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)
Conversational speech translation is an important technology that fosters communication among people of different language backgrounds. Three-way parallel data in the form of source speech, source transcript, and target translation is usually required to train end-to-end systems. However, such datasets are not readily available and are expensive to create as this involves multiple annotation stages. In this paper, we investigate the use of synthetic data from generative models, namely machine translation and text-to-speech synthesis, for training conversational speech translation systems. We show that adding synthetic data to the training recipe increasingly improves end-to-end training performance, especially when limited real data is available. However, when no real data is available, no amount of synthetic data helps.
Search
Fix author
Co-authors
- Kevin Duh 2
- Paul McNamee 2
- Kenton Murray 2
- Matthew Wiesner 2
- Bismarck Bamfo Odoom 1
- show all...