Haibin Wu
2026
Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning
Minseok Kim | Jingxiang Chen | Seong-Gyun Leem | Yin Huang | Rashi Rungta | Zhicheng Ouyang | Haibin Wu | Surya Teja Appini | Ankur Bansal | Yang Bai | Yue Liu | Florian Metze | Ahmed A Aly | Anuj Kumar | Ariya Rastrow | Zhaojiang Lin
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)
Minseok Kim | Jingxiang Chen | Seong-Gyun Leem | Yin Huang | Rashi Rungta | Zhicheng Ouyang | Haibin Wu | Surya Teja Appini | Ankur Bansal | Yang Bai | Yue Liu | Florian Metze | Ahmed A Aly | Anuj Kumar | Ariya Rastrow | Zhaojiang Lin
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)
Speech large language models (LLMs) observe paralinguistic cues such as prosody, emotion, and non-verbal sounds—crucial for intent understanding. However, leveraging these cues faces challenges: limited training data, annotation difficulty, and models exploiting lexical shortcuts over paralinguistic signals. We propose multi-task reinforcement learning (RL) with chain-of-thought prompting that elicits explicit affective reasoning. To address data scarcity, we introduce a paralinguistics-aware speech LLM (PALLM) that jointly optimizes sentiment classification from audio and paralinguistics-aware response generation via a two-stage pipeline. Experiments demonstrate that our approach improves paralinguistics understanding over both supervised baselines and strong proprietary models (Gemini-2.5-Pro, GPT-4o-audio), by 8-12% on Expresso, IEMOCAP, and RAVDESS. The results show that modeling paralinguistic reasoning with multi-task RL is crucial for building emotionally intelligent speech LLMs.
2025
VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music
Jiatong Shi | Hye-jin Shim | Jinchuan Tian | Siddhant Arora | Haibin Wu | Darius Petermann | Jia Qi Yip | You Zhang | Yuxun Tang | Wangyou Zhang | Dareen Safar Alharthi | Yichen Huang | Koichi Saito | Jionghao Han | Yiwen Zhao | Chris Donahue | Shinji Watanabe
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)
Jiatong Shi | Hye-jin Shim | Jinchuan Tian | Siddhant Arora | Haibin Wu | Darius Petermann | Jia Qi Yip | You Zhang | Yuxun Tang | Wangyou Zhang | Dareen Safar Alharthi | Yichen Huang | Koichi Saito | Jionghao Han | Yiwen Zhao | Chris Donahue | Shinji Watanabe
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)
In this work, we introduce VERSA, a unified and standardized evaluation toolkit designed for various speech, audio, and music signals. The toolkit features a Pythonic interface with flexible configuration and dependency control, making it user-friendly and efficient. With full installation, VERSA offers 65 metrics with 729 metric variations based on different configurations. These metrics encompass evaluations utilizing diverse external resources, including matching and non-matching reference audio, text transcriptions, and text captions. As a lightweight yet comprehensive toolkit, VERSA is versatile to support the evaluation of a wide range of downstream scenarios. To demonstrate its capabilities, this work highlights example use cases for VERSA, including audio coding, speech synthesis, speech enhancement, singing synthesis, and music generation. The toolkit is available at https://github.com/shinjiwlab/versa.
2024
Codec-SUPERB: An In-Depth Analysis of Sound Codec Models
Haibin Wu | Ho-Lam Chung | Yi-Cheng Lin | Yuan-Kuei Wu | Xuanjun Chen | Yu-Chi Pai | Hsiu-Hsuan Wang | Kai-Wei Chang | Alexander Liu | Hung-yi Lee
Findings of the Association for Computational Linguistics: ACL 2024
Haibin Wu | Ho-Lam Chung | Yi-Cheng Lin | Yuan-Kuei Wu | Xuanjun Chen | Yu-Chi Pai | Hsiu-Hsuan Wang | Kai-Wei Chang | Alexander Liu | Hung-yi Lee
Findings of the Association for Computational Linguistics: ACL 2024
The sound codec’s dual roles in minimizing data transmission latency and serving as tokenizers underscore its critical importance.Recent years have witnessed significant developments in codec models.The ideal sound codec should preserve content, paralinguistics, speakers, and audio information.However, the question of which codec achieves optimal sound information preservation remains unanswered, as in different papers, models are evaluated on their selected experimental settings.This study introduces Codec-SUPERB, an acronym for Codec sound processing Universal PERformance Benchmark.It is an ecosystem designed to assess codec models across representative sound applications and signal-level metrics rooted in sound domain knowledge.Codec-SUPERB simplifies result sharing through an online leaderboard, promoting collaboration within a community-driven benchmark database, thereby stimulating new development cycles for codecs.Furthermore, we undertake an in-depth analysis to offer insights into codec models from both application and signal perspectives, diverging from previous codec papers mainly concentrating on signal-level comparisons.Finally, we will release codes, the leaderboard, and data to accelerate progress within the community.
Search
Fix author
Co-authors
- Dareen Safar Alharthi 1
- Ahmed A Aly 1
- Surya Teja Appini 1
- Siddhant Arora 1
- Yang Bai 1
- Ankur Bansal 1
- Kai-Wei Chang 1
- Xuanjun Chen 1
- Jingxiang Chen 1
- Ho-Lam Chung 1
- Chris Donahue 1
- Jionghao Han 1
- Yichen Huang 1
- Yin Huang 1
- Minseok Kim 1
- Anuj Kumar 1
- Hung-yi Lee 1
- Seong-Gyun Leem 1
- Yi-Cheng Lin 1
- Zhaojiang Lin 1
- Alex Liu 1
- Yue Liu 1
- Florian Metze 1
- Zhicheng Ouyang 1
- Yu-Chi Pai 1
- Darius Petermann 1
- Ariya Rastrow 1
- Rashi Rungta 1
- Koichi Saito 1
- Jiatong Shi 1
- Hye-jin Shim 1
- Yuxun Tang 1
- Jinchuan Tian 1
- Hsiu-Hsuan Wang 1
- Shinji Watanabe 1
- Yuan-Kuei Wu 1
- Jia Qi Yip 1
- You Zhang 1
- Wangyou Zhang 1
- Yiwen Zhao 1