Chenyuhao Wen
2026
SDiaReward: Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness
Jingyu Lu | Yuhan Wang | Fan Zhuo | Xize Cheng | Changhao Pan | Xueyi Pu | Yifu Chen | Chenyuhao Wen | Tianle Liang | Zhou Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jingyu Lu | Yuhan Wang | Fan Zhuo | Xize Cheng | Changhao Pan | Xueyi Pu | Yifu Chen | Chenyuhao Wen | Tianle Liang | Zhou Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The rapid evolution of end-to-end spoken dialogue systems demands transcending mere textual semantics to incorporate paralinguistic nuances and the spontaneous nature of human conversation. However, current methods struggle with two critical gaps: the modality gap, involving prosody and emotion, and the colloquialness gap, distinguishing written scripts from natural speech. To address these challenges, we introduce SDiaReward, an end-to-end multi-turn reward model trained on SDiaReward-Dataset, a novel collection of episode-level preference pairs explicitly targeting these gaps. It operates directly on full multi-turn speech episodes and is optimized with pairwise preference supervision, enabling joint assessment of modality and colloquialness in a single evaluator. We further establish ESDR-Bench, a stratified benchmark for robust episode-level evaluation. Experiments demonstrate that SDiaReward achieves state-of-the-art pairwise preference accuracy, significantly outperforming general-purpose audio LLMs. Further analysis suggests that SDiaReward captures relative conversational expressiveness beyond superficial synthesis cues, improving generalization across domains and recording conditions.
Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios
Changhao Pan | Rui Yang | Han Wang | Zhuan Zhou | Xuming He | Wenxiang Guo | Ziyue Jiang | Ruiqi Li | Yu Zhang | Chenyuhao Wen | Ke Lei | Xiang Yin | Jingyu Lu | Zhiyuan Zhu | Zhou Zhao
Findings of the Association for Computational Linguistics: ACL 2026
Changhao Pan | Rui Yang | Han Wang | Zhuan Zhou | Xuming He | Wenxiang Guo | Ziyue Jiang | Ruiqi Li | Yu Zhang | Chenyuhao Wen | Ke Lei | Xiang Yin | Jingyu Lu | Zhiyuan Zhu | Zhou Zhao
Findings of the Association for Computational Linguistics: ACL 2026
Recent advances in speech generation have enabled high-fidelity synthesis, yet systematic evaluation of models under long-context conditions remains largely underexplored. A comprehensive evaluation benchmark for long-form speech is indispensable for two reasons: 1) existing test scenarios are often confined to limited domains, creating a significant gap with the diverse downstream applications; 2) existing metrics overlook critical long-text factors such as consistency and coherence, failing to generalize reliably. To this end, we propose LFSBench, a comprehensive benchmark that decomposes “long-form speech quality” into specific, disentangled dimensions. LFSBench has three key properties. 1) Rich speech scenarios: Focusing on long-form speech generation and multi-speaker dialog generation, LFSBench covers acoustics, semantics, and expressiveness challenges, and consists of 1,101 samples spanning 17 common speech scenarios; 2) Comprehensive evaluation dimensions: Along the acoustics, semantics, and expressiveness axes, LFSBench defines an automated evaluation protocol with seven metrics to provide a comprehensive, accurate, and standardized assessment; 3) Valuable Insights: Through extensive experiments, we reveal that current models still struggle in highly expressive scenarios and exhibit a notable gap in consistency and hierarchy compared to real recordings.