Lirong Dai


2022

pdf
SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training
Ziqiang Zhang | Long Zhou | Junyi Ao | Shujie Liu | Lirong Dai | Jinyu Li | Furu Wei
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

The rapid development of single-modal pre-training has prompted researchers to pay more attention to cross-modal pre-training methods. In this paper, we propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder. Leveraging hidden-unit as an interface to align speech and text, we can decompose the speech-to-text model into a speech-to-unit model and a unit-to-text model, which can be jointly pre-trained with unpaired speech and text data respectively. Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks. Experimental results show that SpeechUT gets substantial improvements over strong baselines, and achieves state-of-the-art performance on both the LibriSpeech ASR and MuST-C ST tasks. To better understand the proposed SpeechUT, detailed analyses are conducted. The code and pre-trained models are available at https://aka.ms/SpeechUT.

pdf
The USTC-NELSLIP Offline Speech Translation Systems for IWSLT 2022
Weitai Zhang | Zhongyi Ye | Haitao Tang | Xiaoxi Li | Xinyuan Zhou | Jing Yang | Jianwei Cui | Pan Deng | Mohan Shi | Yifan Song | Dan Liu | Junhua Liu | Lirong Dai
Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)

This paper describes USTC-NELSLIP’s submissions to the IWSLT 2022 Offline Speech Translation task, including speech translation of talks from English to German, English to Chinese and English to Japanese. We describe both cascaded architectures and end-to-end models which can directly translate source speech into target text. In the cascaded condition, we investigate the effectiveness of different model architectures with robust training and achieve 2.72 BLEU improvements over last year’s optimal system on MuST-C English-German test set. In the end-to-end condition, we build models based on Transformer and Conformer architectures, achieving 2.26 BLEU improvements over last year’s optimal end-to-end system. The end-to-end system has obtained promising results, but it is still lagging behind our cascaded models.

2021

pdf bib
The USTC-NELSLIP Systems for Simultaneous Speech Translation Task at IWSLT 2021
Dan Liu | Mengge Du | Xiaoxi Li | Yuchen Hu | Lirong Dai
Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)

This paper describes USTC-NELSLIP’s submissions to the IWSLT2021 Simultaneous Speech Translation task. We proposed a novel simultaneous translation model, Cross-Attention Augmented Transducer (CAAT), which extends conventional RNN-T to sequence-to-sequence tasks without monotonic constraints, e.g., simultaneous translation. Experiments on speech-to-text (S2T) and text-to-text (T2T) simultaneous translation tasks shows CAAT achieves better quality-latency trade-offs compared to wait-k, one of the previous state-of-the-art approaches. Based on CAAT architecture and data augmentation, we build S2T and T2T simultaneous translation systems in this evaluation campaign. Compared to last year’s optimal systems, our S2T simultaneous translation system improves by an average of 11.3 BLEU for all latency regimes, and our T2T simultaneous translation system improves by an average of 4.6 BLEU.

2014

pdf
The USTC machine translation system for IWSLT 2014
Shijin Wang | Yuguang Wang | Jianfeng Li | Yiming Cui | Lirong Dai
Proceedings of the 11th International Workshop on Spoken Language Translation: Evaluation Campaign