2025
pdf
bib
abs
HENT-SRT: Hierarchical Efficient Neural Transducer with Self-Distillation for Joint Speech Recognition and Translation
Amir Hussein
|
Cihan Xiao
|
Matthew Wiesner
|
Dan Povey
|
Leibny Paola Garcia Perera
|
Sanjeev Khudanpur
Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)
Neural transducers (NT) provide an effective framework for speech streaming, demonstrating strong performance in automatic speech recognition (ASR). However, the application of NT to speech translation (ST) remains challenging, as existing approaches struggle with word reordering and performance degradation when jointly modeling ASR and ST, resulting in a gap with attention-based encoder-decoder (AED) models. Existing NT-based ST approaches also suffer from high computational training costs. To address these issues, we propose HENT-SRT (Hierarchical Efficient Neural Transducer for Speech Recognition and Translation), a novel framework that factorizes ASR and translation tasks to better handle reordering. To ensure robust ST while preserving ASR performance, we use self-distillation with CTC consistency regularization. Moreover, we improve computational efficiency by incorporating best practices from ASR transducers, including a down-sampled hierarchical encoder, a stateless predictor, and a pruned transducer loss to reduce training complexity. Finally, we introduce a blank penalty during decoding, reducing deletions and improving translation quality. Our approach is evaluated on three conversational datasets Arabic, Spanish, and Mandarin achieving new state-of-the-art performance among NT models and substantially narrowing the gap with AED-based systems.
pdf
bib
abs
JHU IWSLT 2025 Low-resource System Description
Nathaniel Romney Robinson
|
Niyati Bafna
|
Xiluo He
|
Tom Lupicki
|
Lavanya Shankar
|
Cihan Xiao
|
Qi Sun
|
Kenton Murray
|
David Yarowsky
Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)
We present the Johns Hopkins University’s submission to the 2025 IWSLT Low-Resource Task. We competed on all 10 language pairs. Our approach centers around ensembling methods – specifically Minimum Bayes Risk Decoding. We find that such ensembling often improves performance only slightly over the best performing stand-alone model, and that in some cases it can even hurt performance slightly.
2024
pdf
bib
abs
JHU IWSLT 2024 Dialectal and Low-resource System Description
Nathaniel Romney Robinson
|
Kaiser Sun
|
Cihan Xiao
|
Niyati Bafna
|
Weiting Tan
|
Haoran Xu
|
Henry Li Xinyuan
|
Ankur Kejriwal
|
Sanjeev Khudanpur
|
Kenton Murray
|
Paul McNamee
Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)
Johns Hopkins University (JHU) submitted systems for all eight language pairs in the 2024 Low-Resource Language Track. The main effort of this work revolves around fine-tuning large and publicly available models in three proposed systems: i) end-to-end speech translation (ST) fine-tuning of Seamless4MT v2; ii) ST fine-tuning of Whisper; iii) a cascaded system involving automatic speech recognition with fine-tuned Whisper and machine translation with NLLB. On top of systems above, we conduct a comparative analysis on different training paradigms, such as intra-distillation for NLLB as well as joint training and curriculum learning for SeamlessM4T v2. Our results show that the best-performing approach differs by language pairs, but that i) fine-tuned SeamlessM4T v2 tends to perform best for source languages on which it was pre-trained, ii) multi-task training helps Whisper fine-tuning, iii) cascaded systems with Whisper and NLLB tend to outperform Whisper alone, and iv) intra-distillation helps NLLB fine-tuning.
2023
pdf
bib
abs
JHU IWSLT 2023 Dialect Speech Translation System Description
Amir Hussein
|
Cihan Xiao
|
Neha Verma
|
Thomas Thebaud
|
Matthew Wiesner
|
Sanjeev Khudanpur
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)
This paper presents JHU’s submissions to the IWSLT 2023 dialectal and low-resource track of Tunisian Arabic to English speech translation. The Tunisian dialect lacks formal orthography and abundant training data, making it challenging to develop effective speech translation (ST) systems. To address these challenges, we explore the integration of large pre-trained machine translation (MT) models, such as mBART and NLLB-200 in both end-to-end (E2E) and cascaded speech translation (ST) systems. We also improve the performance of automatic speech recognition (ASR) through the use of pseudo-labeling data augmentation and channel matching on telephone data. Finally, we combine our E2E and cascaded ST systems with Minimum Bayes-Risk decoding. Our combined system achieves a BLEU score of 21.6 and 19.1 on test2 and test3, respectively.