Pengchao Feng
2025
Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation
Pengchao Feng
|
Ziyang Ma
|
Wenxi Chen
|
Yao Li
|
Sheng Wang
|
Kai Yu
|
Xie Chen
Findings of the Association for Computational Linguistics: EMNLP 2025
End-to-end speech-to-speech (S2S) dialogue systems have recently garnered increasing research attention for their lower latency and more natural integration of nonverbal cues such as emotion and speaker identity. However, these systems face key challenges, particularly in incorporating external knowledge, a capability commonly addressed by Retrieval-Augmented Generation (RAG) in text-based large language models (LLMs). The core difficulty lies in the modality gap between input speech and retrieved textual knowledge, which hinders effective integration of information. To address this issue, we propose a novel end-to-end RAG framework that directly retrieves relevant textual knowledge from speech queries. Experimental results demonstrate that our method significantly improves the performance of end-to-end S2S dialogue systems while achieving higher retrieval efficiency. Although the overall performance still lags behind the SOTA cascaded models, our framework offers a promising direction for enhancing knowledge integration in end-to-end S2S systems. Our code and dataset are released.
Search
Fix author
Co-authors
- Wenxi Chen 1
- Xie Chen 1
- Yao Li 1
- Ziyang Ma 1
- Sheng Wang 1
- show all...
- Kai Yu 1