VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music

Jiatong Shi, Hye-jin Shim, Jinchuan Tian, Siddhant Arora, Haibin Wu, Darius Petermann, Jia Qi Yip, You Zhang, Yuxun Tang, Wangyou Zhang, Dareen Safar Alharthi, Yichen Huang, Koichi Saito, Jionghao Han, Yiwen Zhao, Chris Donahue, Shinji Watanabe


Abstract
In this work, we introduce VERSA, a unified and standardized evaluation toolkit designed for various speech, audio, and music signals. The toolkit features a Pythonic interface with flexible configuration and dependency control, making it user-friendly and efficient. With full installation, VERSA offers 65 metrics with 729 metric variations based on different configurations. These metrics encompass evaluations utilizing diverse external resources, including matching and non-matching reference audio, text transcriptions, and text captions. As a lightweight yet comprehensive toolkit, VERSA is versatile to support the evaluation of a wide range of downstream scenarios. To demonstrate its capabilities, this work highlights example use cases for VERSA, including audio coding, speech synthesis, speech enhancement, singing synthesis, and music generation. The toolkit is available at https://github.com/shinjiwlab/versa.
Anthology ID:
2025.naacl-demo.19
Volume:
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Nouha Dziri, Sean (Xiang) Ren, Shizhe Diao
Venues:
NAACL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
191–209
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.naacl-demo.19/
DOI:
Bibkey:
Cite (ACL):
Jiatong Shi, Hye-jin Shim, Jinchuan Tian, Siddhant Arora, Haibin Wu, Darius Petermann, Jia Qi Yip, You Zhang, Yuxun Tang, Wangyou Zhang, Dareen Safar Alharthi, Yichen Huang, Koichi Saito, Jionghao Han, Yiwen Zhao, Chris Donahue, and Shinji Watanabe. 2025. VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations), pages 191–209, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music (Shi et al., NAACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.naacl-demo.19.pdf