A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation

Yulin Xue; Siqi Ouyang; Lei Li

doi:10.18653/v1/2026.iwslt-1.3

A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation

Abstract

Simultaneous speech-to-speech translation (SimulS2ST) enables real-time cross-lingual communication, but existing evaluation has focused largely on short or pre-segmented speech rather than long-form, continuous input. Prior approaches are difficult to reproduce and make assumptions that do not hold for end-to-end systems. We present a practical evaluation method for long-form SimulS2ST. Given source speech, pre-segmented source transcripts, and reference translations, we run automatic speech recognition (ASR) and forced alignment on the generated target speech to recover token-level timestamps, then apply a sentence-embedding-based aligner to match the target text to its corresponding source sentences. This enables sentence-level computation of latency and quality metrics, including YAAL and xCOMET, which are then aggregated into final system-level scores. Experiments on representative SimulS2ST systems show that the method is effective in practice and reveal that current systems suffer from substantial latency accumulation on long speech.

Anthology ID:: 2026.iwslt-1.3
Volume:: Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026)
Month:: July
Year:: 2026
Address:: San Diego, USA (in-person and online)
Editors:: Elizabeth Salesky, Antonios Anastasopoulos, Matteo Negri, Marcello Federico
Venues:: IWSLT | WS
SIG:: SIGSLT
Publisher:: Association for Computational Linguistics
Note:
Pages:: 32–39
Language:
URL:: https://preview.aclanthology.org/bulk-corrections-2026-07-02/2026.iwslt-1.3/
DOI:: 10.18653/v1/2026.iwslt-1.3
Bibkey:
Cite (ACL):: Yulin Xue, Siqi Ouyang, and Lei Li. 2026. A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation. In Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026), pages 32–39, San Diego, USA (in-person and online). Association for Computational Linguistics.
Cite (Informal):: A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation (Xue et al., IWSLT 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/bulk-corrections-2026-07-02/2026.iwslt-1.3.pdf

PDF Cite Search Fix data