CMU’s IWSLT 2024 Offline Speech Translation System: A Cascaded Approach For Long-Form Robustness

Brian Yan, Patrick Fernandes, Jinchuan Tian, Siqi Ouyang, William Chen, Karen Livescu, Lei Li, Graham Neubig, Shinji Watanabe


Abstract
This work describes CMU’s submission to the IWSLT 2024 Offline Speech Translation (ST) Shared Task for translating English speech to German, Chinese, and Japanese text. We are the first participants to employ a long-form strategy which directly processes unsegmented recordings without the need for a separate voice-activity detection stage (VAD). We show that the Whisper automatic speech recognition (ASR) model has a hallucination problem when applied out-of-the-box to recordings containing non-speech noises, but a simple noisy fine-tuning approach can greatly enhance Whisper’s long-form robustness across multiple domains. Then, we feed English ASR outputs into fine-tuned NLLB machine translation (MT) models which are decoded using COMET-based Minimum Bayes Risk. Our VAD-free ASR+MT cascade is tested on TED talks, TV series, and workout videos and shown to outperform prior winning IWSLT submissions and large open-source models.
Anthology ID:
2024.iwslt-1.22
Volume:
Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)
Month:
August
Year:
2024
Address:
Bangkok, Thailand (in-person and online)
Editors:
Elizabeth Salesky, Marcello Federico, Marine Carpuat
Venue:
IWSLT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
164–169
Language:
URL:
https://aclanthology.org/2024.iwslt-1.22
DOI:
10.18653/v1/2024.iwslt-1.22
Bibkey:
Cite (ACL):
Brian Yan, Patrick Fernandes, Jinchuan Tian, Siqi Ouyang, William Chen, Karen Livescu, Lei Li, Graham Neubig, and Shinji Watanabe. 2024. CMU’s IWSLT 2024 Offline Speech Translation System: A Cascaded Approach For Long-Form Robustness. In Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024), pages 164–169, Bangkok, Thailand (in-person and online). Association for Computational Linguistics.
Cite (Informal):
CMU’s IWSLT 2024 Offline Speech Translation System: A Cascaded Approach For Long-Form Robustness (Yan et al., IWSLT 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2024.iwslt-1.22.pdf