Luis Tavarez-Arce


2024

pdf
Can Synthetic Speech Improve End-to-End Conversational Speech Translation?
Bismarck Bamfo Odoom | Nathaniel Robinson | Elijah Rippeth | Luis Tavarez-Arce | Kenton Murray | Matthew Wiesner | Paul McNamee | Philipp Koehn | Kevin Duh
Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

Conversational speech translation is an important technology that fosters communication among people of different language backgrounds. Three-way parallel data in the form of source speech, source transcript, and target translation is usually required to train end-to-end systems. However, such datasets are not readily available and are expensive to create as this involves multiple annotation stages. In this paper, we investigate the use of synthetic data from generative models, namely machine translation and text-to-speech synthesis, for training conversational speech translation systems. We show that adding synthetic data to the training recipe increasingly improves end-to-end training performance, especially when limited real data is available. However, when no real data is available, no amount of synthetic data helps.