Haotian Tan


2025

pdf bib
NAIST Offline Speech Translation System for IWSLT 2025
Ruhiyah Faradishi Widiaputri | Haotian Tan | Jan Meyer Saragih | Yuka Ko | Katsuhito Sudoh | Satoshi Nakamura | Sakriani Sakti
Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)

This paper presents NAIST’s submission to the offline speech translation task of the IWSLT 2025 evaluation campaign, focusing on English-to-German and English-to-Chinese translation. We implemented both cascade and end-to-end frameworks using various components. For the cascade approach, we used Whisper and SALMONN as automatic speech recognition systems, each paired with Qwen2.5 large language model (LLM) for translation. In the end-to-end setting, we used SALMONN as speech translation and also built a custom model combining the Whisper encoder, DeCo projector, and Qwen2.5 LLM. To further leverage the large language model capabilities, we experimented with different prompting strategies. Additionally, since long speech inputs are segmented for processing, we applied hypothesis combination techniques to generate the final translation output. Our results show that combining Whisper and LLMs can yield strong translation performance, even without further fine-tuning in the cascade setup. Moreover, our proposed end-to-end architecture achieved competitive results, despite being trained on significantly less data compared to SALMONN. Finally, we decided to use both SALMONN as an end-to-end speech translation model and our proposed end-to-end model for our IWSLT 2025 submission for both language pairs.

pdf bib
NAIST Simultaneous Speech Translation System for IWSLT 2025
Haotian Tan | Ruhiyah Faradishi Widiaputri | Jan Meyer Saragih | Yuka Ko | Katsuhito Sudoh | Satoshi Nakamura | Sakriani Sakti
Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)

This paper describes the NAIST submission to the English-to-German, Japanese, Chinese Simultaneous Speech-to-Text track at IWSLT 2025. Last year, our system was based on an end-to-end speech-to-text translation model that combined HuBERT and mBART. This year, the system consists of a Whisper encoder, the DeCo compressive projector, and the Qwen large language model. The simultaneous translation (SimulST) system is implemented by applying a local agreement policy to an offline-trained translation model. For the streaming translation (StreamST) system, we integrate an online version of the SHAS segmenter into our SimulST architecture. Our results demonstrate that adopting LLMs as the backbone architecture for speech translation tasks yields strong translation performance. Additionally, leveraging robust segmentation capability of SHAS for StreamST achieves good quality-latency trade-off when processing unbounded audio streams.

2024

pdf bib
NAIST Simultaneous Speech Translation System for IWSLT 2024
Yuka Ko | Ryo Fukuda | Yuta Nishikawa | Yasumasa Kano | Tomoya Yanagita | Kosuke Doi | Mana Makinae | Haotian Tan | Makoto Sakai | Sakriani Sakti | Katsuhito Sudoh | Satoshi Nakamura
Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)

This paper describes NAIST’s submission to the simultaneous track of the IWSLT 2024 Evaluation Campaign: English-to-German, Japanese, Chinese speech-to-text translation and English-to-Japanese speech-to-speech translation. We develop a multilingual end-to-end speech-to-text translation model combining two pre-trained language models, HuBERT and mBART. We trained this model with two decoding policies, Local Agreement (LA) and AlignAtt. The submitted models employ the LA policy because it outperformed the AlignAtt policy in previous models. Our speech-to-speech translation method is a cascade of the above speech-to-text model and an incremental text-to-speech (TTS) module that incorporates a phoneme estimation model, a parallel acoustic model, and a parallel WaveGAN vocoder. We improved our incremental TTS by applying the Transformer architecture with the AlignAtt policy for the estimation model. The results show that our upgraded TTS module contributed to improving the system performance.