Training Simultaneous Speech Translation with Robust and Random Wait-k-Tokens Strategy

Linlin Zhang, Kai Fan, Jiajun Bu, Zhongqiang Huang


Abstract
Simultaneous Speech Translation (SimulST) is a task focused on ensuring high-quality translation of speech in low-latency situations. Despite this, the modality gap (e.g., unknown word boundaries) between audio and text presents a challenge. This gap hinders the effective application of policies from simultaneous text translation (SimulMT) and compromises the performance of offline speech translation. To address this issue, we first leverage the Montreal Forced Aligner (MFA) and utilize audio transcription pairs in pre-training the acoustic encoder, and introduce a token-level cross-modal alignment that allows the wait-k policy from SimulMT to better adapt to SimulST. This token-level boundary alignment simplifies the decision-making process for predicting read/write actions, as if the decoder were directly processing text tokens. Subsequently, to optimize the SimulST task, we propose a robust and random wait-k-tokens strategy. This strategy allows a single model to meet various latency requirements and minimizes error accumulation of boundary alignment during inference. Our experiments on the MuST-C dataset show that our method achieves better trade-off between translation quality and latency.
Anthology ID:
2023.emnlp-main.484
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7814–7831
Language:
URL:
https://aclanthology.org/2023.emnlp-main.484
DOI:
10.18653/v1/2023.emnlp-main.484
Bibkey:
Cite (ACL):
Linlin Zhang, Kai Fan, Jiajun Bu, and Zhongqiang Huang. 2023. Training Simultaneous Speech Translation with Robust and Random Wait-k-Tokens Strategy. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7814–7831, Singapore. Association for Computational Linguistics.
Cite (Informal):
Training Simultaneous Speech Translation with Robust and Random Wait-k-Tokens Strategy (Zhang et al., EMNLP 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl-24-ws-corrections/2023.emnlp-main.484.pdf
Video:
 https://preview.aclanthology.org/naacl-24-ws-corrections/2023.emnlp-main.484.mp4