Beyond Transcripts: A Renewed Perspective on Audio Chaptering

Fabian Retkowski, Maike Z\"ufle, Thai Binh Nguyen, Jan Niehues, Alexander Waibel


Abstract
Audio chaptering, the task of automatically segmenting long-form audio into coherent sections, is increasingly important for navigating podcasts, lectures, and videos. Despite its relevance, research remains limited and text-based, leaving key questions unresolved about leveraging audio information, handling ASR errors, and transcript-free evaluation. We address these gaps through three contributions: (1) a systematic comparison between text-based models with acoustic features, a novel audio-only architecture (AudioSeg) operating on learned audio representations, and multimodal LLMs; (2) empirical analysis of factors affecting performance, including transcript quality, acoustic features, duration, and speaker composition; and (3) formalized evaluation protocols contrasting transcript-dependent text-space protocols with transcript-invariant time-space protocols. Our experiments on YTSeg reveal that AudioSeg substantially outperforms text-based approaches, pauses provide the largest acoustic gains, and current MLLMs struggle due to context limitations and weak instruction following.
Anthology ID:
2026.acl-long.396
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8765–8787
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.396/
DOI:
Bibkey:
Cite (ACL):
Fabian Retkowski, Maike Z\"ufle, Thai Binh Nguyen, Jan Niehues, and Alexander Waibel. 2026. Beyond Transcripts: A Renewed Perspective on Audio Chaptering. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8765–8787, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Beyond Transcripts: A Renewed Perspective on Audio Chaptering (Retkowski et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.396.pdf
Checklist:
 2026.acl-long.396.checklist.pdf