Beyond Transcripts: A Renewed Perspective on Audio Chaptering
Fabian Retkowski, Maike Z\"ufle, Thai Binh Nguyen, Jan Niehues, Alexander Waibel
Abstract
Audio chaptering, the task of automatically segmenting long-form audio into coherent sections, is increasingly important for navigating podcasts, lectures, and videos. Despite its relevance, research remains limited and text-based, leaving key questions unresolved about leveraging audio information, handling ASR errors, and transcript-free evaluation. We address these gaps through three contributions: (1) a systematic comparison between text-based models with acoustic features, a novel audio-only architecture (AudioSeg) operating on learned audio representations, and multimodal LLMs; (2) empirical analysis of factors affecting performance, including transcript quality, acoustic features, duration, and speaker composition; and (3) formalized evaluation protocols contrasting transcript-dependent text-space protocols with transcript-invariant time-space protocols. Our experiments on YTSeg reveal that AudioSeg substantially outperforms text-based approaches, pauses provide the largest acoustic gains, and current MLLMs struggle due to context limitations and weak instruction following.- Anthology ID:
- 2026.acl-long.396
- Volume:
- Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 8765–8787
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.acl-long.396/
- DOI:
- Cite (ACL):
- Fabian Retkowski, Maike Z\"ufle, Thai Binh Nguyen, Jan Niehues, and Alexander Waibel. 2026. Beyond Transcripts: A Renewed Perspective on Audio Chaptering. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8765–8787, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Beyond Transcripts: A Renewed Perspective on Audio Chaptering (Retkowski et al., ACL 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.acl-long.396.pdf