Zhicheng Ouyang
2026
Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning
Minseok Kim | Jingxiang Chen | Seong-Gyun Leem | Yin Huang | Rashi Rungta | Zhicheng Ouyang | Haibin Wu | Surya Teja Appini | Ankur Bansal | Yang Bai | Yue Liu | Florian Metze | Ahmed A Aly | Anuj Kumar | Ariya Rastrow | Zhaojiang Lin
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)
Minseok Kim | Jingxiang Chen | Seong-Gyun Leem | Yin Huang | Rashi Rungta | Zhicheng Ouyang | Haibin Wu | Surya Teja Appini | Ankur Bansal | Yang Bai | Yue Liu | Florian Metze | Ahmed A Aly | Anuj Kumar | Ariya Rastrow | Zhaojiang Lin
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)
Speech large language models (LLMs) observe paralinguistic cues such as prosody, emotion, and non-verbal sounds—crucial for intent understanding. However, leveraging these cues faces challenges: limited training data, annotation difficulty, and models exploiting lexical shortcuts over paralinguistic signals. We propose multi-task reinforcement learning (RL) with chain-of-thought prompting that elicits explicit affective reasoning. To address data scarcity, we introduce a paralinguistics-aware speech LLM (PALLM) that jointly optimizes sentiment classification from audio and paralinguistics-aware response generation via a two-stage pipeline. Experiments demonstrate that our approach improves paralinguistics understanding over both supervised baselines and strong proprietary models (Gemini-2.5-Pro, GPT-4o-audio), by 8-12% on Expresso, IEMOCAP, and RAVDESS. The results show that modeling paralinguistic reasoning with multi-task RL is crucial for building emotionally intelligent speech LLMs.