SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases

Laya Iyer, Angelina Wang, Sanmi Koyejo


Abstract
Advances in large language models (LLMs) have enabled significant capabilities in audio processing, resulting in state-of-the-art models now known as Large Audio Language Models (LALMs). However, minimal work has been done to measure audio understanding beyond automatic speech recognition (ASR). This paper closes that gap by proposing a benchmark suite, SCENEBench (Spatial, Cross-lingual, Environmental, Non-speech Evaluation), that targets a broad form of audio comprehension across four real-world categories: background sound understanding, noise localization, cross-linguistic speech understanding, and vocal characterizer recognition. In addition to performance, we also measure model latency. The purpose of this benchmark suite is to assess the audio beyond just what words are said— rather, in how they are said and the non-speech components of the audio. To strengthen ecological validity, we include a small human-recorded evaluation split per category. Based on the needs articulated by audio understanding use-cases of accessibility technology and industrial noise monitoring, this benchmark reveals critical gaps in current LALMs. The performance in each task is quite varied, with some tasks having performance far below random chance and others with high accuracy. We also provide a structured error taxonomy to characterize standard failure modes across tasks. These results provide direction for targeted improvements in model capabilities.
Anthology ID:
2026.eacl-long.335
Volume:
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7123–7137
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.335/
DOI:
Bibkey:
Cite (ACL):
Laya Iyer, Angelina Wang, and Sanmi Koyejo. 2026. SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7123–7137, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases (Iyer et al., EACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.335.pdf