Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation

Pengchao Feng, Ziyang Ma, Wenxi Chen, Yao Li, Sheng Wang, Kai Yu, Xie Chen


Abstract
End-to-end speech-to-speech (S2S) dialogue systems have recently garnered increasing research attention for their lower latency and more natural integration of nonverbal cues such as emotion and speaker identity. However, these systems face key challenges, particularly in incorporating external knowledge, a capability commonly addressed by Retrieval-Augmented Generation (RAG) in text-based large language models (LLMs). The core difficulty lies in the modality gap between input speech and retrieved textual knowledge, which hinders effective integration of information. To address this issue, we propose a novel end-to-end RAG framework that directly retrieves relevant textual knowledge from speech queries. Experimental results demonstrate that our method significantly improves the performance of end-to-end S2S dialogue systems while achieving higher retrieval efficiency. Although the overall performance still lags behind the SOTA cascaded models, our framework offers a promising direction for enhancing knowledge integration in end-to-end S2S systems. Our code and dataset are released.
Anthology ID:
2025.findings-emnlp.241
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4499–4507
Language:
URL:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.241/
DOI:
10.18653/v1/2025.findings-emnlp.241
Bibkey:
Cite (ACL):
Pengchao Feng, Ziyang Ma, Wenxi Chen, Yao Li, Sheng Wang, Kai Yu, and Xie Chen. 2025. Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 4499–4507, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation (Feng et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.241.pdf
Checklist:
 2025.findings-emnlp.241.checklist.pdf