UltraEval-Audio: A Unified Framework for Comprehensive Evaluation of Audio Foundation Models

Qundong Shi, Jie Zhou, Biyuan Lin, Junbo Cui, Guoyang Zeng, Yixuan Zhou, Ziyang Wang, Xin Liu, Zhen Luo, Yudong Wang, Zhiyuan Liu


Abstract
The development of audio foundation models has accelerated rapidly since the emergence of GPT-4o. However, the lack of comprehensive evaluation has become a critical bottleneck for further progress in the field, particularly in audio generation. Current audio evaluation faces three major challenges: (1) audio evaluation lacks a unified framework, with datasets and code scattered across various sources;(2) audio codec, as a key component of audio foundation models, lacks a widely accepted and holistic evaluation methodology; (3) existing speech benchmarks are heavily reliant on English, making it challenging to objectively assess models’ performance on Chinese.We introduce UltraEval-Audio, a unified framework addressing these challenges through a modular architecture supporting 10 languages, 14 task categories, 24 models, and 36 benchmarks with one-command evaluation and real-time leaderboards. For audio codec, we propose a three-dimensional evaluation scheme covering semantic accuracy, timbre fidelity, and acoustic quality. For Chinese evaluation, we introduce two new benchmarks: SpeechCMMLU and SpeechHSK. Our code, benchmarks, and leaderboards are available at https://github.com/OpenBMB/UltraEval-Audio.
Anthology ID:
2026.acl-demo.56
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Greg Durrett, Ping Jian
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
566–577
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-demo.56/
DOI:
Bibkey:
Cite (ACL):
Qundong Shi, Jie Zhou, Biyuan Lin, Junbo Cui, Guoyang Zeng, Yixuan Zhou, Ziyang Wang, Xin Liu, Zhen Luo, Yudong Wang, and Zhiyuan Liu. 2026. UltraEval-Audio: A Unified Framework for Comprehensive Evaluation of Audio Foundation Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 566–577, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
UltraEval-Audio: A Unified Framework for Comprehensive Evaluation of Audio Foundation Models (Shi et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-demo.56.pdf