ESPnet How2 Speech Translation System for IWSLT 2019: Pre-training, Knowledge Distillation, and Going Deeper

Hirofumi Inaguma, Shun Kiyono, Nelson Enrique Yalta Soplin, Jun Suzuki, Kevin Duh, Shinji Watanabe


Abstract
This paper describes the ESPnet submissions to the How2 Speech Translation task at IWSLT2019. In this year, we mainly build our systems based on Transformer architectures in all tasks and focus on the end-to-end speech translation (E2E-ST). We first compare RNN-based models and Transformer, and then confirm Transformer models significantly and consistently outperform RNN models in all tasks and corpora. Next, we investigate pre-training of E2E-ST models with the ASR and MT tasks. On top of the pre-training, we further explore knowledge distillation from the NMT model and the deeper speech encoder, and confirm drastic improvements over the baseline model. All of our codes are publicly available in ESPnet.
Anthology ID:
2019.iwslt-1.4
Volume:
Proceedings of the 16th International Conference on Spoken Language Translation
Month:
November 2-3
Year:
2019
Address:
Hong Kong
Venues:
EMNLP | IWSLT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
Language:
URL:
https://aclanthology.org/2019.iwslt-1.4
DOI:
Bibkey:
Cite (ACL):
Hirofumi Inaguma, Shun Kiyono, Nelson Enrique Yalta Soplin, Jun Suzuki, Kevin Duh, and Shinji Watanabe. 2019. ESPnet How2 Speech Translation System for IWSLT 2019: Pre-training, Knowledge Distillation, and Going Deeper. In Proceedings of the 16th International Conference on Spoken Language Translation, Hong Kong. Association for Computational Linguistics.
Cite (Informal):
ESPnet How2 Speech Translation System for IWSLT 2019: Pre-training, Knowledge Distillation, and Going Deeper (Inaguma et al., IWSLT 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/update-css-js/2019.iwslt-1.4.pdf
Data
LibriSpeechMuST-C