Soumi Maiti


2024

pdf
Towards Robust Speech Representation Learning for Thousands of Languages
William Chen | Wangyou Zhang | Yifan Peng | Xinjian Li | Jinchuan Tian | Jiatong Shi | Xuankai Chang | Soumi Maiti | Karen Livescu | Shinji Watanabe
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data. However, models are still far from supporting the world’s 7000+ languages. We propose XEUS, a Cross-lingual Encoder for Universal Speech, trained on over 1 million hours of data across 4057 languages, extending the language coverage of SSL models 4-fold. We combine 1 million hours of speech from existing publicly accessible corpora with a newly created corpus of 7400+ hours from 4057 languages, which will be publicly released. To handle the diverse conditions of multilingual speech data, we augment the typical SSL masked prediction approach with a novel dereverberation objective, increasing robustness. We evaluate XEUS on several benchmarks, and show that it consistently outperforms or achieves comparable results to state-of-the-art (SOTA) SSL models across a variety of tasks. XEUS sets a new SOTA on the ML-SUPERB benchmark: it outperforms MMS 1B and w2v-BERT 2.0 v2 by 0.8% and 4.4% respectively, despite having less parameters or pre-training data. Checkpoints, code, and data are found in https://www.wavlab.org/activities/2024/xeus/.

2023

pdf
ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit
Brian Yan | Jiatong Shi | Yun Tang | Hirofumi Inaguma | Yifan Peng | Siddharth Dalmia | Peter Polák | Patrick Fernandes | Dan Berrebbi | Tomoki Hayashi | Xiaohui Zhang | Zhaoheng Ni | Moto Hira | Soumi Maiti | Juan Pino | Shinji Watanabe
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitated by the broadening interests of the spoken language translation community. ESPnet-ST-v2 supports 1) offline speech-to-text translation (ST), 2) simultaneous speech-to-text translation (SST), and 3) offline speech-to-speech translation (S2ST) – each task is supported with a wide variety of approaches, differentiating ESPnet-ST-v2 from other open source spoken language translation toolkits. This toolkit offers state-of-the-art architectures such as transducers, hybrid CTC/attention, multi-decoders with searchable intermediates, time-synchronous blockwise CTC/attention, Translatotron models, and direct discrete unit models. In this paper, we describe the overall design, example models for each task, and performance benchmarking behind ESPnet-ST-v2, which is publicly available at https://github.com/espnet/espnet.

pdf
CMU’s IWSLT 2023 Simultaneous Speech Translation System
Brian Yan | Jiatong Shi | Soumi Maiti | William Chen | Xinjian Li | Yifan Peng | Siddhant Arora | Shinji Watanabe
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)

This paper describes CMU’s submission to the IWSLT 2023 simultaneous speech translation shared task for translating English speech to both German text and speech in a streaming fashion. We first build offline speech-to-text (ST) models using the joint CTC/attention framework. These models also use WavLM front-end features and mBART decoder initialization. We adapt our offline ST models for simultaneous speech-to-text translation (SST) by 1) incrementally encoding chunks of input speech, re-computing encoder states for each new chunk and 2) incrementally decoding output text, pruning beam search hypotheses to 1-best after processing each chunk. We then build text-to-speech (TTS) models using the VITS framework and achieve simultaneous speech-to-speech translation (SS2ST) by cascading our SST and TTS models.