Fabian Retkowski
2026
Speech Translation and Metrics in 2026: Findings of the IWSLT Campaign
David Ifeoluwa Adelani | Victor Agostinelli | Antonios Anastasopoulos | Luisa Bentivogli | Ondřej Bojar | Sébastien Bratières | Marine Carpuat | Fabrício Carraro | Roldano Cattoni | Mauro Cettolo | Lizhong Chen | Marcello Federico | Marco Gaido | Mahendra Gupta | HyoJung Han | Ali Hatami | Lewis C. Howe | Dávid Javorský | Yejin Jeon | Marek Kasztelnik | Antoine Laurent | Danni Liu | Nam Luu | Min Ma | Dominik Macháček | Marie Maltais | Evgeny Matusov | John McCrae | Chutong Meng | Chandresh Kumar Maurya | Mohammad Mohammadamini | Yasmin Moslem | Kenton Murray | Satoshi Nakamura | Matteo Negri | Jan Niehues | Atul Kr. Ojha | John E. Ortega | Siqi Ouyang | Sara Papi | Peter Polák | Fabian Retkowski | Stephanny Sánchez | Beatrice Savoldi | Claytone Sikasote | Matthias Sperber | Sebastian Stüker | Katsuhito Sudoh | Marie Tahon | Marco Turchi | Alexander Waibel | Patrick Wilken | Rodolfo Joel Zevallos | Vilem Zouhar | Maike Züfle
Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026)
David Ifeoluwa Adelani | Victor Agostinelli | Antonios Anastasopoulos | Luisa Bentivogli | Ondřej Bojar | Sébastien Bratières | Marine Carpuat | Fabrício Carraro | Roldano Cattoni | Mauro Cettolo | Lizhong Chen | Marcello Federico | Marco Gaido | Mahendra Gupta | HyoJung Han | Ali Hatami | Lewis C. Howe | Dávid Javorský | Yejin Jeon | Marek Kasztelnik | Antoine Laurent | Danni Liu | Nam Luu | Min Ma | Dominik Macháček | Marie Maltais | Evgeny Matusov | John McCrae | Chutong Meng | Chandresh Kumar Maurya | Mohammad Mohammadamini | Yasmin Moslem | Kenton Murray | Satoshi Nakamura | Matteo Negri | Jan Niehues | Atul Kr. Ojha | John E. Ortega | Siqi Ouyang | Sara Papi | Peter Polák | Fabian Retkowski | Stephanny Sánchez | Beatrice Savoldi | Claytone Sikasote | Matthias Sperber | Sebastian Stüker | Katsuhito Sudoh | Marie Tahon | Marco Turchi | Alexander Waibel | Patrick Wilken | Rodolfo Joel Zevallos | Vilem Zouhar | Maike Züfle
Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026)
This paper reports on the outcomes of the shared tasks organized as part of the 23rd International Workshop on Spoken Language Translation (IWSLT). The workshop covered ten major challenges in spoken language translation, including speech-to-text translation for both high-resource and low-resource language pairs, customized speech translation, speech generation, instruction-following speech processing, and the evaluation of speech translation systems. The shared tasks received strong participation, with more than 30 teams submitting runs. This year’s edition broadened the range of tasks, placing particular emphasis on speech generation and evaluation metrics.
Multilingual Long-Form Speech Instruction Following: KIT’s Submission to IWSLT 2026
Enes Yavuz Ugan | Maike Züfle | Yuka Ko | Supriti Sinhamahapatra | Fabian Retkowski | Seymanur Akti | Jan Niehues | Alexander Waibel
Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026)
Enes Yavuz Ugan | Maike Züfle | Yuka Ko | Supriti Sinhamahapatra | Fabian Retkowski | Seymanur Akti | Jan Niehues | Alexander Waibel
Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026)
With the advent of Large Language Models, single-task and token-based multi-task models have evolved into instruction-based systems that infer task and target language implicitly from natural language prompts. This trend is reflected in IWSLT’s Instruction Following Track, which this year introduced new tasks including an unknown surprise task, posing a genuine challenge against overfitting to known tasks. We present KIT’s submission to the Long and Short Instruction Following tracks in the unconstrained setting. Our approach combines a general data augmentation pipeline that converts short-form corpora into long-form training data through segment concatenation, LLM-based label generation, and cross-lingual translation, yielding over 1M instances across six tasks and four languages. We further show that likelihood-based re-ranking, while highly effective for ASR, systematically degrades semantic tasks by spuriously selecting candidates generated from segmented audio processing rather than holistic long-form inference, a failure mode resolved by combining likelihood with Minimum Bayes Risk decoding.
BOOM: Beyond Only One Modality KIT’s Multimodal Multilingual Lecture Companion
Sai Koneru | Fabian Retkowski | Christian Huber | Lukas Hilgert | Seymanur Akti | Enes Yavuz Ugan | Alexander Waibel | Jan Niehues
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Sai Koneru | Fabian Retkowski | Christian Huber | Lukas Hilgert | Seymanur Akti | Enes Yavuz Ugan | Alexander Waibel | Jan Niehues
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 3: System Demonstrations)
The globalization of education and rapid growth of online learning have made localizing educational content a critical challenge. Lecture materials are inherently multimodal, combining spoken audio with visual slides, which requires systems capable of processing multiple input modalities. To provide an accessible and complete learning experience, translations must preserve all modalities: text for reading, slides for visual understanding, and speech for auditory learning. We present BOOM, a multimodal multilingual lecture companion that jointly translates lecture audio and slides to produce synchronized outputs across three modalities: translated text, localized slides with preserved visual elements, and synthesized speech. This end-to-end approach enables students to access lectures in their native language while aiming to preserve the original content in its entirety. Our experiments demonstrate that slide-aware transcripts also yield cascading benefits for downstream tasks such as summarization and question answering. We release our Slide Translation code at https://github.com/saikoneru/image-translator and integrate it in Lecture Translator at https://gitlab.kit.edu/kit/isl-ai4lt/lt-middleware/ltpipeline[All released code and models are licensed under the MIT License].
Beyond Transcripts: A Renewed Perspective on Audio Chaptering
Fabian Retkowski | Maike Züfle | Thai Binh Nguyen | Jan Niehues | Alexander Waibel
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Fabian Retkowski | Maike Züfle | Thai Binh Nguyen | Jan Niehues | Alexander Waibel
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Audio chaptering, the task of automatically segmenting long-form audio into coherent sections, is increasingly important for navigating podcasts, lectures, and videos. Despite its relevance, research remains limited and text-based, leaving key questions unresolved about leveraging audio information, handling ASR errors, and transcript-free evaluation. We address these gaps through three contributions: (1) a systematic comparison between text-based models with acoustic features, a novel audio-only architecture (AudioSeg) operating on learned audio representations, and multimodal LLMs; (2) empirical analysis of factors affecting performance, including transcript quality, acoustic features, duration, and speaker composition; and (3) formalized evaluation protocols contrasting transcript-dependent text-space protocols with transcript-invariant time-space protocols. Our experiments on YTSeg reveal that AudioSeg substantially outperforms text-based approaches, pauses provide the largest acoustic gains, and current MLLMs struggle due to context limitations and weak instruction following.
2025
The AI Co-Ethnographer: How Far Can Automation Take Qualitative Research?
Fabian Retkowski | Andreas Sudmann | Alexander Waibel
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities
Fabian Retkowski | Andreas Sudmann | Alexander Waibel
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities
Qualitative research often involves labor-intensive processes that are difficult to scale while preserving analytical depth. This paper introduces The AI Co-Ethnographer (AICoE), a novel end-to-end pipeline developed for qualitative research and designed to move beyond the limitations of simply automating code assignments, offering a more integrated approach. AICoE organizes the entire process, encompassing open coding, code consolidation, code application, and even pattern discovery, leading to a comprehensive analysis of qualitative data.
Zero-Shot Strategies for Length-Controllable Summarization
Fabian Retkowski | Alexander Waibel
Findings of the Association for Computational Linguistics: NAACL 2025
Fabian Retkowski | Alexander Waibel
Findings of the Association for Computational Linguistics: NAACL 2025
Large language models (LLMs) struggle with precise length control, particularly in zero-shot settings. We conduct a comprehensive study evaluating LLMs’ length control capabilities across multiple measures and propose practical methods to improve controllability. Our experiments with LLaMA 3 reveal stark differences in length adherence across measures and highlight inherent biases of the model. To address these challenges, we introduce a set of methods: length approximation, target adjustment, sample filtering, and automated revisions. By combining these methods, we demonstrate substantial improvements in length compliance while maintaining or enhancing summary quality, providing highly effective zero-shot strategies for precise length control without the need for model fine-tuning or architectural changes. With our work, we not only advance our understanding of LLM behavior in controlled text generation but also pave the way for more reliable and adaptable summarization systems in real-world applications.
Summarizing Speech: A Comprehensive Survey
Fabian Retkowski | Maike Züfle | Andreas Sudmann | Dinah Pfau | Shinji Watanabe | Jan Niehues | Alexander Waibel
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Fabian Retkowski | Maike Züfle | Andreas Sudmann | Dinah Pfau | Shinji Watanabe | Jan Niehues | Alexander Waibel
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Speech summarization has become an essential tool for efficiently managing and accessing the growing volume of spoken and audiovisual content. However, despite its increasing importance, speech summarization remains loosely defined. The field intersects with several research areas, including speech recognition, text summarization, and specific applications like meeting summarization. This survey not only examines existing datasets and evaluation protocols, which are crucial for assessing the quality of summarization approaches, but also synthesizes recent developments in the field, highlighting the shift from traditional systems to advanced models like fine-tuned cascaded architectures and end-to-end solutions. In doing so, we surface the ongoing challenges, such as the need for realistic evaluation benchmarks, multilingual datasets, and long-context handling.
2024
From Text Segmentation to Smart Chaptering: A Novel Benchmark for Structuring Video Transcriptions
Fabian Retkowski | Alexander Waibel
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Fabian Retkowski | Alexander Waibel
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Text segmentation is a fundamental task in natural language processing, where documents are split into contiguous sections. However, prior research in this area has been constrained by limited datasets, which are either small in scale, synthesized, or only contain well-structured documents. In this paper, we address these limitations by introducing a novel benchmark YTSeg focusing on spoken content that is inherently more unstructured and both topically and structurally diverse. As part of this work, we introduce an efficient hierarchical segmentation model MiniSeg, that outperforms state-of-the-art baselines. Lastly, we expand the notion of text segmentation to a more practical “smart chaptering” task that involves the segmentation of unstructured content, the generation of meaningful segment titles, and a potential real-time application of the models.
2023
End-to-End Evaluation for Low-Latency Simultaneous Speech Translation
Christian Huber | Tu Anh Dinh | Carlos Mullov | Ngoc-Quan Pham | Thai Binh Nguyen | Fabian Retkowski | Stefan Constantin | Enes Ugan | Danni Liu | Zhaolin Li | Sai Koneru | Jan Niehues | Alexander Waibel
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Christian Huber | Tu Anh Dinh | Carlos Mullov | Ngoc-Quan Pham | Thai Binh Nguyen | Fabian Retkowski | Stefan Constantin | Enes Ugan | Danni Liu | Zhaolin Li | Sai Koneru | Jan Niehues | Alexander Waibel
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
The challenge of low-latency speech translation has recently draw significant interest in the research community as shown by several publications and shared tasks. Therefore, it is essential to evaluate these different approaches in realistic scenarios. However, currently only specific aspects of the systems are evaluated and often it is not possible to compare different approaches. In this work, we propose the first framework to perform and evaluate the various aspects of low-latency speech translation under realistic conditions. The evaluation is carried out in an end-to-end fashion. This includes the segmentation of the audio as well as the run-time of the different components. Secondly, we compare different approaches to low-latency speech translation using this framework. We evaluate models with the option to revise the output as well as methods with fixed output. Furthermore, we directly compare state-of-the-art cascaded as well as end-to-end systems. Finally, the framework allows to automatically evaluate the translation quality as well as latency and also provides a web interface to show the low-latency model outputs to the user.
Search
Fix author
Co-authors
- Jan Niehues 6
- Alex Waibel 5
- Maike Züfle 4
- Seymanur Akti 2
- Christian Huber 2
- Sai Koneru 2
- Danni Liu 2
- Thai Binh Nguyen 2
- Andreas Sudmann 2
- Enes Yavuz Ugan 2
- Alexander Waibel 2
- Alexander Waibel 2
- David Ifeoluwa Adelani 1
- Victor Agostinelli 1
- Antonios Anastasopoulos 1
- Luisa Bentivogli 1
- Ondřej Bojar 1
- Sébastien Bratières 1
- Marine Carpuat 1
- Fabrício Carraro 1
- Roldano Cattoni 1
- Mauro Cettolo 1
- Lizhong Chen 1
- Stefan Constantin 1
- Tu Anh Dinh 1
- Marcello Federico 1
- Marco Gaido 1
- Mahendra Gupta 1
- HyoJung Han 1
- Ali Hatami 1
- Lukas Hilgert 1
- Lewis C. Howe 1
- Dávid Javorský 1
- Yejin Jeon 1
- Marek Kasztelnik 1
- Yuka Ko 1
- Antoine Laurent 1
- Zhaolin Li 1
- Nam Luu 1
- Min Ma 1
- Dominik Macháček 1
- Marie Maltais 1
- Evgeny Matusov 1
- Chandresh Kumar Maurya 1
- John Philip McCrae 1
- Chutong Meng 1
- Mohammad Mohammadamini 1
- Yasmin Moslem 1
- Carlos Mullov 1
- Kenton Murray 1
- Satoshi Nakamura 1
- Matteo Negri 1
- Atul Kr. Ojha 1
- John E. Ortega 1
- Siqi Ouyang 1
- Sara Papi 1
- Dinah Pfau 1
- Ngoc-Quan Pham 1
- Peter Polák 1
- Beatrice Savoldi 1
- Claytone Sikasote 1
- Supriti Sinhamahapatra 1
- Matthias Sperber 1
- Sebastian Stüker 1
- Katsuhito Sudoh 1
- Stephanny Sánchez 1
- Marie Tahon 1
- Marco Turchi 1
- Enes Ugan 1
- Shinji Watanabe 1
- Patrick Wilken 1
- Rodolfo Zevallos 1
- Vilém Zouhar 1