Tianle Liang
2026
SDiaReward: Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness
Jingyu Lu | Yuhan Wang | Fan Zhuo | Xize Cheng | Changhao Pan | Xueyi Pu | Yifu Chen | Chenyuhao Wen | Tianle Liang | Zhou Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jingyu Lu | Yuhan Wang | Fan Zhuo | Xize Cheng | Changhao Pan | Xueyi Pu | Yifu Chen | Chenyuhao Wen | Tianle Liang | Zhou Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The rapid evolution of end-to-end spoken dialogue systems demands transcending mere textual semantics to incorporate paralinguistic nuances and the spontaneous nature of human conversation. However, current methods struggle with two critical gaps: the modality gap, involving prosody and emotion, and the colloquialness gap, distinguishing written scripts from natural speech. To address these challenges, we introduce SDiaReward, an end-to-end multi-turn reward model trained on SDiaReward-Dataset, a novel collection of episode-level preference pairs explicitly targeting these gaps. It operates directly on full multi-turn speech episodes and is optimized with pairwise preference supervision, enabling joint assessment of modality and colloquialness in a single evaluator. We further establish ESDR-Bench, a stratified benchmark for robust episode-level evaluation. Experiments demonstrate that SDiaReward achieves state-of-the-art pairwise preference accuracy, significantly outperforming general-purpose audio LLMs. Further analysis suggests that SDiaReward captures relative conversational expressiveness beyond superficial synthesis cues, improving generalization across domains and recording conditions.
Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models
Yifu Chen | Shengpeng Ji | Zhengqing Liu | Qian Chen | Wen Wang | Ziqing Wang | Yangzhuo Li | Tianle Liang | Zhou Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yifu Chen | Shengpeng Ji | Zhengqing Liu | Qian Chen | Wen Wang | Ziqing Wang | Yangzhuo Li | Tianle Liang | Zhou Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Achieving seamless, human-like interaction remains a key challenge for full-duplex spoken dialogue models (SDMs). Reinforcement learning (RL) has substantially enhanced text- and vision-language models, while well-designed reward signals are crucial for the performance of RL. We consider RL a promising strategy to address the key challenge for SDMs. However, a fundamental barrier persists: prevailing automated metrics for assessing interaction quality rely on superficial proxies, such as behavioral statistics or timing-prediction accuracy, failing to provide reliable reward signals for RL. On the other hand, human evaluations, despite their richness, remain costly, inconsistent, and difficult to scale. We tackle this critical barrier by proposing a Dual-Axis Generative Reward Model, which is trained to understand complex interaction dynamics using a detailed taxonomy and an annotated dataset, produces a single score and, crucially, provides separate evaluations for semantic quality and interaction timing. Such dual outputs furnish precise diagnostic feedback for SDMs and deliver a dependable, instructive reward signal suitable for online reinforcement learning. Our model achieves state-of-the-art performance on interaction-quality assessment across a wide spectrum of datasets, spanning synthetic dialogues and complex real-world interactions.
Dual-Reasoner: Bridging Interleaved Atomicity and Streaming Latency via Thinking-while-Talking
Yangzhuo Li | Shengpeng Ji | Yifu Chen | Tianle Liang | Haoyu Yang | Junboli | Jun Fang | Lin Li | Qingyang Hong
Findings of the Association for Computational Linguistics: ACL 2026
Yangzhuo Li | Shengpeng Ji | Yifu Chen | Tianle Liang | Haoyu Yang | Junboli | Jun Fang | Lin Li | Qingyang Hong
Findings of the Association for Computational Linguistics: ACL 2026
Integrating explicit Chain-of-Thought (CoT) into end-to-end spoken dialogue models enhances intelligence but incurs prohibitive latency. While the "Thinking-while-Talking" paradigm alleviates this delay, it fundamentally compromises block atomicity, severing the logical connection between interleaved thought and speech. To address this, we present Dual-Reasoner, employing a Streaming Masking Mechanism underpinned by our Dual-Think-30k dataset to guarantee uninterrupted audio streaming. Crucially, to strictly align the fragmented thinking blocks to service speech generation, we introduce the Atomic-Consistency Restoration framework. To secure comprehensive capabilities in high-difficulty reasoning, this mechanism utilizes a quadruple-constraint system to reconstruct logical atomicity, ensuring that "think" chunks act as a rigorous anchor for "talk" outputs. Experimental results demonstrate that Dual-Reasoner achieves comprehensive reasoning enhancements within ultra-low latency constraints: it elevates the VoiceBench score from 67.24 to 73.41 over the baseline, while significantly reducing the Time-to-First-Audio (TTFA) from 20.35s to 3.65s and the Real-Time Factor (RTF) from 7.04 to 1.05.
WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training
Yifu Chen | Shengpeng Ji | Qian Chen | Tianle Liang | Yangzhuo Li | Ziqing Wang | Wen Wang | Jingyu Lu | Haoxiao Wang | Xueyi Pu | Fan Zhuo | Zhou Zhao
Findings of the Association for Computational Linguistics: ACL 2026
Yifu Chen | Shengpeng Ji | Qian Chen | Tianle Liang | Yangzhuo Li | Ziqing Wang | Wen Wang | Jingyu Lu | Haoxiao Wang | Xueyi Pu | Fan Zhuo | Zhou Zhao
Findings of the Association for Computational Linguistics: ACL 2026
End-to-end spoken dialogue models have garnered significant attention because they offer a higher potential ceiling in expressiveness and perceptual ability than cascaded systems. However, the intelligence and expressiveness of current open-source spoken dialogue models often remain below expectations. Motivated by the success of online reinforcement learning(RL) in other domains, one might attempt to directly apply preference optimization to spoken dialogue models, yet this transfer is non-trivial. We analyze these obstacles from the perspectives of reward modeling and rollout sampling, focusing on how sparse preference supervision interacts with dense speech generation under shared-parameter updates. Based on the analysis, we propose a modality-aware adaptive post-training recipe that makes RL practical for spoken dialogue: it constrains preference updates to the semantic channel and improves acoustic behavior via explicit anchoring, while dynamically regulating their mixture from rollout statistics to avoid unreliable preference gradients. We evaluate the method across multiple spoken dialogue benchmarks and representative architectures, and observe consistent improvements in semantic quality and speech expressiveness.
VoxMind: An End-to-End Agentic Spoken Dialogue System
Tianle Liang | Yifu Chen | Shengpeng Ji | Yijun Chen | Zhiyang Jia | Jingyu Lu | Fan Zhuo | Xueyi Pu | Yangzhuo Li | Zhou Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tianle Liang | Yifu Chen | Shengpeng Ji | Yijun Chen | Zhiyang Jia | Jingyu Lu | Fan Zhuo | Xueyi Pu | Yangzhuo Li | Zhou Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent end-to-end spoken dialogue models enable natural interaction. However, as user demands become increasingly complex, models that rely solely on conversational abilities often struggle to cope. Incorporating agentic capabilities is therefore essential: by enabling tool use, these models can extend their knowledge boundaries and better solve real-world tasks. Yet, existing research has largely concentrated on core perception and generation, with comparatively limited exploration of such tool-augmented extensions. To bridge this gap, we present VoxMind, an integrated framework designed to equip end-to-end spoken dialogue models with comprehensive agentic abilities. Leveraging our curated 470-hour AgentChat dataset, we incorporate a "Think-before-Speak" mechanism, enabling the model to internalize structured reasoning as a critical prerequisite for planning and response generation. Furthermore, to mitigate latency bottlenecks caused by large-scale tool integration, we propose a Multi-Agent Dynamic Tool Management architecture. By asynchronously delegating retrieval tasks to an auxiliary agent aligned with the main model’s reasoning trajectory, this system effectively decouples inference latency from toolset size. Experimental results confirm that VoxMind achieves significant improvements in agent performance: compared with strong baselines, the task completion rate increases from 34.88% to 74.57%, outperforming Gemini-2.5-Pro on spoken agent tasks while preserving general conversational quality. The source code and associated data are publicly available at https://github.com/MM-Speech/VoxMind.