Haoxiao Wang
2026
WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training
Yifu Chen | Shengpeng Ji | Qian Chen | Tianle Liang | Yangzhuo Li | Ziqing Wang | Wen Wang | Jingyu Lu | Haoxiao Wang | Xueyi Pu | Fan Zhuo | Zhou Zhao
Findings of the Association for Computational Linguistics: ACL 2026
Yifu Chen | Shengpeng Ji | Qian Chen | Tianle Liang | Yangzhuo Li | Ziqing Wang | Wen Wang | Jingyu Lu | Haoxiao Wang | Xueyi Pu | Fan Zhuo | Zhou Zhao
Findings of the Association for Computational Linguistics: ACL 2026
End-to-end spoken dialogue models have garnered significant attention because they offer a higher potential ceiling in expressiveness and perceptual ability than cascaded systems. However, the intelligence and expressiveness of current open-source spoken dialogue models often remain below expectations. Motivated by the success of online reinforcement learning(RL) in other domains, one might attempt to directly apply preference optimization to spoken dialogue models, yet this transfer is non-trivial. We analyze these obstacles from the perspectives of reward modeling and rollout sampling, focusing on how sparse preference supervision interacts with dense speech generation under shared-parameter updates. Based on the analysis, we propose a modality-aware adaptive post-training recipe that makes RL practical for spoken dialogue: it constrains preference updates to the semantic channel and improves acoustic behavior via explicit anchoring, while dynamically regulating their mixture from rollout statistics to avoid unreliable preference gradients. We evaluate the method across multiple spoken dialogue benchmarks and representative architectures, and observe consistent improvements in semantic quality and speech expressiveness.
2025
WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models
Yifu Chen | Shengpeng Ji | Haoxiao Wang | Ziqing Wang | Siyu Chen | Jinzheng He | Jin Xu | Zhou Zhao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yifu Chen | Shengpeng Ji | Haoxiao Wang | Ziqing Wang | Siyu Chen | Jinzheng He | Jin Xu | Zhou Zhao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Retrieval Augmented Generation (RAG) has gained widespread adoption owing to its capacity to empower large language models (LLMs) to integrate external knowledge. However, existing RAG frameworks are primarily designed for text-based LLMs and rely on Automatic Speech Recognition to process speech input, which discards crucial audio information, risks transcription errors, and increases computational overhead. Therefore, we introduce WavRAG, the first retrieval augmented generation framework with native, end-to-end audio support. WavRAG offers two key features: 1) Bypassing ASR, WavRAG directly processes raw audio for both embedding and retrieval. 2) WavRAG integrates audio and text into a unified knowledge representation. Specifically, we propose the WavRetriever to facilitate the retrieval from a text-audio hybrid knowledge base, and further enhance the in-context capabilities of spoken dialogue models through the integration of chain-of-thought reasoning. In comparison to state-of-the-art ASR-Text RAG pipelines, WavRAG achieves comparable retrieval performance while delivering a 10x acceleration. Furthermore, WavRAG’s unique text-audio hybrid retrieval capability extends the boundaries of RAG to the audio modality.