VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models

Wenqian Cui; Xiaoqi Jiao; Ziqiao Meng; Irwin King

VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models

Wenqian Cui, Xiaoqi Jiao, Ziqiao Meng, Irwin King

Abstract

With the rising need for speech-based interaction models, end-to-end Spoken Language Models (SLMs) have emerged as a promising solution. While these models require comprehensive world knowledge for meaningful and reliable human interactions, existing question-answering (QA) benchmarks fall short in evaluating SLMs’ knowledge understanding due to their inability to support end-to-end speech evaluation and account for varied input audio conditions. To address these limitations, we present VoxEval, a novel SpeechQA benchmark that assesses SLMs’ knowledge understanding through pure speech interactions. Our benchmark uniquely maintains speech format for both inputs and outputs, evaluates model robustness across diverse input audio conditions, and pioneers the assessment of complex tasks like mathematical reasoning in spoken format. Through systematic evaluation, we demonstrate that current SLMs exhibit poor performance on VoxEval, show sensitivity to varying audio conditions, and possess limited reasoning capabilities, highlighting critical areas for future development. VoxEval dataset is available at: https://github.com/dreamtheater123/VoxEval

Anthology ID:: 2025.acl-long.818
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 16735–16753
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.818/
DOI:
Bibkey:
Cite (ACL):: Wenqian Cui, Xiaoqi Jiao, Ziqiao Meng, and Irwin King. 2025. VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16735–16753, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models (Cui et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.818.pdf

PDF Cite Search Fix data