Test-Time Adaptation of an Offline Multimodal Foundation Model for Simultaneous Speech Translation

Yi Xing; Manli Yu; Pengfei Liu; Helen Meng

doi:10.18653/v1/2026.iwslt-1.27

Test-Time Adaptation of an Offline Multimodal Foundation Model for Simultaneous Speech Translation

Yi Xing, Manli Yu, Pengfei Liu, Helen Meng

Abstract

End-to-end simultaneous speech-to-text translation (SimulST) systems typically rely on complex architectures and sophisticated training strategies. In contrast, we propose a simple approach that combines conventional pause-based segmentation for streaming audio input with a strong off-the-shelf multimodal foundation model adapted at test-time for translation. To achieve simultaneity, we adopt a variant of the classic wait-k read-write policy to control the interaction between audio input and translation output, and use a multi-turn conversation format with response prefilling and key-value caching for coherent translation and computational efficiency. Experiments on the official development sets of the IWSLT 2026 SimulST shared task show that our system achieves a better quality–latency trade-off than the cascaded baseline across all language directions and latency regimes, highlighting the effectiveness of this simple yet powerful approach.

Anthology ID:: 2026.iwslt-1.27
Volume:: Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026)
Month:: July
Year:: 2026
Address:: San Diego, USA (in-person and online)
Editors:: Elizabeth Salesky, Antonios Anastasopoulos, Matteo Negri, Marcello Federico
Venues:: IWSLT | WS
SIG:: SIGSLT
Publisher:: Association for Computational Linguistics
Note:
Pages:: 238–246
Language:
URL:: https://preview.aclanthology.org/bulk-corrections-2026-07-02/2026.iwslt-1.27/
DOI:: 10.18653/v1/2026.iwslt-1.27
Bibkey:
Cite (ACL):: Yi Xing, Manli Yu, Pengfei Liu, and Helen Meng. 2026. Test-Time Adaptation of an Offline Multimodal Foundation Model for Simultaneous Speech Translation. In Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026), pages 238–246, San Diego, USA (in-person and online). Association for Computational Linguistics.
Cite (Informal):: Test-Time Adaptation of an Offline Multimodal Foundation Model for Simultaneous Speech Translation (Xing et al., IWSLT 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/bulk-corrections-2026-07-02/2026.iwslt-1.27.pdf

PDF Cite Search Fix data