Jionghao Han


2025

pdf bib
VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music
Jiatong Shi | Hye-jin Shim | Jinchuan Tian | Siddhant Arora | Haibin Wu | Darius Petermann | Jia Qi Yip | You Zhang | Yuxun Tang | Wangyou Zhang | Dareen Safar Alharthi | Yichen Huang | Koichi Saito | Jionghao Han | Yiwen Zhao | Chris Donahue | Shinji Watanabe
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)

In this work, we introduce VERSA, a unified and standardized evaluation toolkit designed for various speech, audio, and music signals. The toolkit features a Pythonic interface with flexible configuration and dependency control, making it user-friendly and efficient. With full installation, VERSA offers 65 metrics with 729 metric variations based on different configurations. These metrics encompass evaluations utilizing diverse external resources, including matching and non-matching reference audio, text transcriptions, and text captions. As a lightweight yet comprehensive toolkit, VERSA is versatile to support the evaluation of a wide range of downstream scenarios. To demonstrate its capabilities, this work highlights example use cases for VERSA, including audio coding, speech synthesis, speech enhancement, singing synthesis, and music generation. The toolkit is available at https://github.com/shinjiwlab/versa.

2024

pdf bib
On the Evaluation of Speech Foundation Models for Spoken Language Understanding
Siddhant Arora | Ankita Pasad | Chung-Ming Chien | Jionghao Han | Roshan Sharma | Jee-weon Jung | Hira Dhamyal | William Chen | Suwon Shon | Hung-yi Lee | Karen Livescu | Shinji Watanabe
Findings of the Association for Computational Linguistics: ACL 2024

The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was recently introduced to address the need for openresources and benchmarking of complex spoken language understanding (SLU) tasks, including both classification and sequence generation tasks, on natural speech. The benchmark has demonstrated preliminary success in using pre-trained speech foundation models (SFM) for these SLU tasks. However, the community still lacks a fine-grained understanding of the comparative utility of different SFMs. Inspired by this, we ask: which SFMs offer the most benefits for these complex SLU tasks, and what is the most effective approach for incorporating these SFMs? To answer this, we perform an extensive evaluation of multiple supervised and self-supervised SFMs using several evaluation protocols: (i) frozen SFMs with a lightweight prediction head, (ii) frozen SFMs with a complex prediction head, and (iii) fine-tuned SFMs with a lightweight prediction head. Although the supervised SFMs are pre-trained on much more speech recognition data (with labels), they do not always outperform self-supervised SFMs; the latter tend to perform at least as well as, and sometimes better than, supervised SFMs, especially on the sequence generation tasks in SLUE. While there is no universally optimal way of incorporating SFMs, the complex prediction head gives the best performance for most tasks, although it increases the inference time. We also introduce an open-source toolkit and performance leaderboard, SLUE-PERB, for these tasks and modeling strategies.