2024
pdf
abs
NAIST Simultaneous Speech Translation System for IWSLT 2024
Yuka Ko
|
Ryo Fukuda
|
Yuta Nishikawa
|
Yasumasa Kano
|
Tomoya Yanagita
|
Kosuke Doi
|
Mana Makinae
|
Haotian Tan
|
Makoto Sakai
|
Sakriani Sakti
|
Katsuhito Sudoh
|
Satoshi Nakamura
Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)
This paper describes NAIST’s submission to the simultaneous track of the IWSLT 2024 Evaluation Campaign: English-to-German, Japanese, Chinese speech-to-text translation and English-to-Japanese speech-to-speech translation. We develop a multilingual end-to-end speech-to-text translation model combining two pre-trained language models, HuBERT and mBART. We trained this model with two decoding policies, Local Agreement (LA) and AlignAtt. The submitted models employ the LA policy because it outperformed the AlignAtt policy in previous models. Our speech-to-speech translation method is a cascade of the above speech-to-text model and an incremental text-to-speech (TTS) module that incorporates a phoneme estimation model, a parallel acoustic model, and a parallel WaveGAN vocoder. We improved our incremental TTS by applying the Transformer architecture with the AlignAtt policy for the estimation model. The results show that our upgraded TTS module contributed to improving the system performance.
pdf
abs
Word Order in English-Japanese Simultaneous Interpretation: Analyses and Evaluation using Chunk-wise Monotonic Translation
Kosuke Doi
|
Yuka Ko
|
Mana Makinae
|
Katsuhito Sudoh
|
Satoshi Nakamura
Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)
This paper analyzes the features of monotonic translations, which follow the word order of the source language, in simultaneous interpreting (SI). Word order differences are one of the biggest challenges in SI, especially for language pairs with significant structural differences like English and Japanese. We analyzed the characteristics of chunk-wise monotonic translation (CMT) sentences using the NAIST English-to-Japanese Chunk-wise Monotonic Translation Evaluation Dataset and identified some grammatical structures that make monotonic translation difficult in English-Japanese SI. We further investigated the features of CMT sentences by evaluating the output from the existing speech translation (ST) and simultaneous speech translation (simulST) models on the NAIST English-to-Japanese Chunk-wise Monotonic Translation Evaluation Dataset as well as on existing test sets. The results indicate the possibility that the existing SI-based test set underestimates the model performance. The results also suggest that using CMT sentences as references gives higher scores to simulST models than ST models, and that using an offline-based test set to evaluate the simulST models underestimates the model performance.
2023
pdf
abs
NAIST Simultaneous Speech-to-speech Translation System for IWSLT 2023
Ryo Fukuda
|
Yuta Nishikawa
|
Yasumasa Kano
|
Yuka Ko
|
Tomoya Yanagita
|
Kosuke Doi
|
Mana Makinae
|
Sakriani Sakti
|
Katsuhito Sudoh
|
Satoshi Nakamura
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)
This paper describes NAIST’s submission to the IWSLT 2023 Simultaneous Speech Translation task: English-to-German, Japanese, Chinese speech-to-text translation and English-to-Japanese speech-to-speech translation. Our speech-to-text system uses an end-to-end multilingual speech translation model based on large-scale pre-trained speech and text models. We add Inter-connections into the model to incorporate the outputs from intermediate layers of the pre-trained speech model and augment prefix-to-prefix text data using Bilingual Prefix Alignment to enhance the simultaneity of the offline speech translation model. Our speech-to-speech system employs an incremental text-to-speech module that consists of a Japanese pronunciation estimation model, an acoustic model, and a neural vocoder.