Viet Anh Khoa Tran


Does Joint Training Really Help Cascaded Speech Translation?
Viet Anh Khoa Tran | David Thulke | Yingbo Gao | Christian Herold | Hermann Ney
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Currently, in speech translation, the straightforward approach - cascading a recognition system with a translation system - delivers state-of-the-art results.However, fundamental challenges such as error propagation from the automatic speech recognition system still remain.To mitigate these problems, recently, people turn their attention to direct data and propose various joint training methods.In this work, we seek to answer the question of whether joint training really helps cascaded speech translation.We review recent papers on the topic and also investigate a joint training criterion by marginalizing the transcription posterior probabilities.Our findings show that a strong cascaded baseline can diminish any improvements obtained using joint training, and we suggest alternatives to joint training.We hope this work can serve as a refresher of the current speech translation landscape, and motivate research in finding more efficient and creative ways to utilize the direct data for speech translation.


Analysis of Positional Encodings for Neural Machine Translation
Jan Rosendahl | Viet Anh Khoa Tran | Weiyue Wang | Hermann Ney
Proceedings of the 16th International Conference on Spoken Language Translation

In this work we analyze and compare the behavior of the Transformer architecture when using different positional encoding methods. While absolute and relative positional encoding perform equally strong overall, we show that relative positional encoding is vastly superior (4.4% to 11.9% BLEU) when translating a sentence that is longer than any observed training sentence. We further propose and analyze variations of relative positional encoding and observe that the number of trainable parameters can be reduced without a performance loss, by using fixed encoding vectors or by removing some of the positional encoding vectors.