2024
pdf
abs
Revisiting Interpolation Augmentation for Speech-to-Text Generation
Chen Xu
|
Jie Wang
|
Xiaoqian Liu
|
Qian Dong
|
Chunliang Zhang
|
Tong Xiao
|
JingBo Zhu
|
Dapeng Man
|
Wu Yang
Findings of the Association for Computational Linguistics ACL 2024
Speech-to-text (S2T) generation systems frequently face challenges in low-resource scenarios, primarily due to the lack of extensive labeled datasets. One emerging solution is constructing virtual training samples by interpolating inputs and labels, which has notably enhanced system generalization in other domains. Despite its potential, this technique’s application in S2T tasks has remained under-explored. In this paper, we delve into the utility of interpolation augmentation, guided by several pivotal questions. Our findings reveal that employing an appropriate strategy in interpolation augmentation significantly enhances performance across diverse tasks, architectures, and data scales, offering a promising avenue for more robust S2T systems in resource-constrained settings.
2023
pdf
abs
The NiuTrans End-to-End Speech Translation System for IWSLT23 English-to-Chinese Offline Task
Yuchen Han
|
Xiaoqian Liu
|
Hao Chen
|
Yuhao Zhang
|
Chen Xu
|
Tong Xiao
|
Jingbo Zhu
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)
This paper describes the NiuTrans end-to-end speech translation system submitted for the IWSLT 2023 English-to-Chinese offline task. Our speech translation models are composed of pre-trained ASR and MT models under the SATE framework. Several pre-trained models with diverse architectures and input representations (e.g., log Mel-filterbank and waveform) were utilized. We proposed an IDA method to iteratively improve the performance of the MT models and generate the pseudo ST data through MT systems. We then trained ST models with different structures and data settings to enhance ensemble performance. Experimental results demonstrate that our NiuTrans system achieved a BLEU score of 29.22 on the MuST-C En-Zh tst-COMMON set, outperforming the previous year’s submission by 0.12 BLEU despite using less MT training data.
pdf
abs
Bridging the Granularity Gap for Acoustic Modeling
Chen Xu
|
Yuhao Zhang
|
Chengbo Jiao
|
Xiaoqian Liu
|
Chi Hu
|
Xin Zeng
|
Tong Xiao
|
Anxiang Ma
|
Huizhen Wang
|
Jingbo Zhu
Findings of the Association for Computational Linguistics: ACL 2023
While Transformer has become the de-facto standard for speech, modeling upon the fine-grained frame-level features remains an open challenge of capturing long-distance dependencies and distributing the attention weights. We propose Progressive Down-Sampling (PDS) which gradually compresses the acoustic features into coarser-grained units containing more complete semantic information, like text-level representation. In addition, we develop a representation fusion method to alleviate information loss that occurs inevitably during high compression. In this way, we compress the acoustic features into 1/32 of the initial length while achieving better or comparable performances on the speech recognition task. And as a bonus, it yields inference speedups ranging from 1.20x to 1.47x.By reducing the modeling burden, we also achieve competitive results when training on the more challenging speech translation task.
pdf
abs
CTC-based Non-autoregressive Speech Translation
Chen Xu
|
Xiaoqian Liu
|
Xiaowen Liu
|
Qingxuan Sun
|
Yuhao Zhang
|
Murun Yang
|
Qianqian Dong
|
Tom Ko
|
Mingxuan Wang
|
Tong Xiao
|
Anxiang Ma
|
Jingbo Zhu
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Combining end-to-end speech translation (ST) and non-autoregressive (NAR) generation is promising in language and speech processing for their advantages of less error propagation and low latency. In this paper, we investigate the potential of connectionist temporal classification (CTC) for non-autoregressive speech translation (NAST).In particular, we develop a model consisting of two encoders that are guided by CTC to predict the source and target texts, respectively. Introducing CTC into NAST on both language sides has obvious challenges: 1) the conditional independent generation somewhat breaks the interdependency among tokens, and 2) the monotonic alignment assumption in standard CTC does not hold in translation tasks. In response, we develop a prediction-aware encoding approach and a cross-layer attention approach to address these issues. We also use curriculum learning to improve convergence of training. Experiments on the MuST-C ST benchmarks show that our NAST model achieves an average BLEU score of 29.5 with a speed-up of 5.67×, which is comparable to the autoregressive counterpart and even outperforms the previous best result of 0.9 BLEU points.
2022
pdf
abs
The NiuTrans’s Submission to the IWSLT22 English-to-Chinese Offline Speech Translation Task
Yuhao Zhang
|
Canan Huang
|
Chen Xu
|
Xiaoqian Liu
|
Bei Li
|
Anxiang Ma
|
Tong Xiao
|
Jingbo Zhu
Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)
This paper describes NiuTrans’s submission to the IWSLT22 English-to-Chinese (En-Zh) offline speech translation task. The end-to-end and bilingual system is built by constrained English and Chinese data and translates the English speech to Chinese text without intermediate transcription. Our speech translation models are composed of different pre-trained acoustic models and machine translation models by two kinds of adapters. We compared the effect of the standard speech feature (e.g. log Mel-filterbank) and the pre-training speech feature and try to make them interact. The final submission is an ensemble of three potential speech translation models. Our single best and ensemble model achieves 18.66 BLEU and 19.35 BLEU separately on MuST-C En-Zh tst-COMMON set.
2021
pdf
abs
The NiuTrans End-to-End Speech Translation System for IWSLT 2021 Offline Task
Chen Xu
|
Xiaoqian Liu
|
Xiaowen Liu
|
Tiger Wang
|
Canan Huang
|
Tong Xiao
|
Jingbo Zhu
Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)
This paper describes the submission of the NiuTrans end-to-end speech translation system for the IWSLT 2021 offline task, which translates from the English audio to German text directly without intermediate transcription. We use the Transformer-based model architecture and enhance it by Conformer, relative position encoding, and stacked acoustic and textual encoding. To augment the training data, the English transcriptions are translated to German translations. Finally, we employ ensemble decoding to integrate the predictions from several models trained with the different datasets. Combining these techniques, we achieve 33.84 BLEU points on the MuST-C En-De test set, which shows the enormous potential of the end-to-end model.
2020
pdf
abs
The NiuTrans Machine Translation Systems for WMT20
Yuhao Zhang
|
Ziyang Wang
|
Runzhe Cao
|
Binghao Wei
|
Weiqiao Shan
|
Shuhan Zhou
|
Abudurexiti Reheman
|
Tao Zhou
|
Xin Zeng
|
Laohu Wang
|
Yongyu Mu
|
Jingnan Zhang
|
Xiaoqian Liu
|
Xuanjun Zhou
|
Yinqiao Li
|
Bei Li
|
Tong Xiao
|
Jingbo Zhu
Proceedings of the Fifth Conference on Machine Translation
This paper describes NiuTrans neural machine translation systems of the WMT20 news translation tasks. We participated in Japanese<->English, English->Chinese, Inuktitut->English and Tamil->English total five tasks and rank first in Japanese<->English both sides. We mainly utilized iterative back-translation, different depth and widen model architectures, iterative knowledge distillation and iterative fine-tuning. And we find that adequately widened and deepened the model simultaneously, the performance will significantly improve. Also, iterative fine-tuning strategy we implemented is effective during adapting domain. For Inuktitut->English and Tamil->English tasks, we built multilingual models separately and employed pretraining word embedding to obtain better performance.
pdf
abs
Pretrain-KGE: Learning Knowledge Representation from Pretrained Language Models
Zhiyuan Zhang
|
Xiaoqian Liu
|
Yi Zhang
|
Qi Su
|
Xu Sun
|
Bin He
Findings of the Association for Computational Linguistics: EMNLP 2020
Conventional knowledge graph embedding (KGE) often suffers from limited knowledge representation, leading to performance degradation especially on the low-resource problem. To remedy this, we propose to enrich knowledge representation via pretrained language models by leveraging world knowledge from pretrained models. Specifically, we present a universal training framework named Pretrain-KGE consisting of three phases: semantic-based fine-tuning phase, knowledge extracting phase and KGE training phase. Extensive experiments show that our proposed Pretrain-KGE can improve results over KGE models, especially on solving the low-resource problem.