Laya Iyer


2026

Advances in large language models (LLMs) have enabled significant capabilities in audio processing, resulting in state-of-the-art models now known as Large Audio Language Models (LALMs). However, minimal work has been done to measure audio understanding beyond automatic speech recognition (ASR). This paper closes that gap by proposing a benchmark suite, SCENEBench (Spatial, Cross-lingual, Environmental, Non-speech Evaluation), that targets a broad form of audio comprehension across four real-world categories: background sound understanding, noise localization, cross-linguistic speech understanding, and vocal characterizer recognition. In addition to performance, we also measure model latency. The purpose of this benchmark suite is to assess the audio beyond just what words are said— rather, in how they are said and the non-speech components of the audio. To strengthen ecological validity, we include a small human-recorded evaluation split per category. Based on the needs articulated by audio understanding use-cases of accessibility technology and industrial noise monitoring, this benchmark reveals critical gaps in current LALMs. The performance in each task is quite varied, with some tasks having performance far below random chance and others with high accuracy. We also provide a structured error taxonomy to characterize standard failure modes across tasks. These results provide direction for targeted improvements in model capabilities.
The next-token prediction (NTP) objective has been foundational in the development of modern large language models (LLMs), driving advances in fluency and generalization. However, NTP operates at the token level, treating deviations from a single reference continuation as errors even when alternative continuations are equally plausible or semantically equivalent. As a result, token-level loss can penalize valid abstractions, paraphrases, or conceptually correct reasoning paths, biasing models toward surface form rather than underlying meaning. This mismatch between the training signal and semantic correctness motivates learning objectives that operate over higher-level representations.We propose a shift from token-level to concept-level prediction, where concepts group multiple surface forms of the same idea (e.g., "mom," "mommy," "mother" MOTHER). We introduce various methods for integrating conceptual supervision into LLM training and show that concept-aware models achieve lower perplexity, improved robustness under domain shift, and stronger performance than NTP-based models on diverse NLP benchmarks. This suggests concept-level supervision as an improved training signal that better aligns LLMs with human semantic abstractions.