VoiceBench: Benchmarking LLM-Based Voice Assistants

Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T. Tan, Haizhou Li


Abstract
Recent advancements in large language models (LLMs) like GPT-4o have enabled real-time speech interactions through LLM-based voice assistants, offering an improved user experience over text-based interactions. However, a suitable benchmark to rigorously evaluate such speech interactions systems is currently lacking. To bridge this gap, we introduce VoiceBench, the first benchmark specifically designed to assess LLM-based voice assistants. VoiceBench comprises 6,783 synthetic and real spoken instructions recorded from diverse speakers across eight distinct tasks. These instructions are meticulously crafted to assess three crucial capability areas: general knowledge, instruction-following, and safety compliance. Furthermore, VoiceBench systematically incorporates realistic variations common in spoken interactions, including differences in speaker characteristics (e.g., accents), heterogeneous environmental conditions (e.g., reverberation), and content complexities such as mispronunciations. Extensive experiments reveal the limitations of current LLM-based voice assistant models and offer valuable insights for future research and development in this field.1
Anthology ID:
2026.tacl-1.18
Volume:
Transactions of the Association for Computational Linguistics, Volume 14
Month:
Year:
2026
Address:
Cambridge, MA
Venue:
TACL
SIG:
Publisher:
MIT Press
Note:
Pages:
378–398
Language:
URL:
https://preview.aclanthology.org/ingest-latest-mitpress-cl-tacl/2026.tacl-1.18/
DOI:
10.1162/tacl.a.628
Bibkey:
Cite (ACL):
Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T. Tan, and Haizhou Li. 2026. VoiceBench: Benchmarking LLM-Based Voice Assistants. Transactions of the Association for Computational Linguistics, 14:378–398.
Cite (Informal):
VoiceBench: Benchmarking LLM-Based Voice Assistants (Chen et al., TACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-latest-mitpress-cl-tacl/2026.tacl-1.18.pdf