Yifu Chen
2025
WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models
Yifu Chen
|
Shengpeng Ji
|
Haoxiao Wang
|
Ziqing Wang
|
Siyu Chen
|
Jinzheng He
|
Jin Xu
|
Zhou Zhao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Retrieval Augmented Generation (RAG) has gained widespread adoption owing to its capacity to empower large language models (LLMs) to integrate external knowledge. However, existing RAG frameworks are primarily designed for text-based LLMs and rely on Automatic Speech Recognition to process speech input, which discards crucial audio information, risks transcription errors, and increases computational overhead. Therefore, we introduce WavRAG, the first retrieval augmented generation framework with native, end-to-end audio support. WavRAG offers two key features: 1) Bypassing ASR, WavRAG directly processes raw audio for both embedding and retrieval. 2) WavRAG integrates audio and text into a unified knowledge representation. Specifically, we propose the WavRetriever to facilitate the retrieval from a text-audio hybrid knowledge base, and further enhance the in-context capabilities of spoken dialogue models through the integration of chain-of-thought reasoning. In comparison to state-of-the-art ASR-Text RAG pipelines, WavRAG achieves comparable retrieval performance while delivering a 10x acceleration. Furthermore, WavRAG’s unique text-audio hybrid retrieval capability extends the boundaries of RAG to the audio modality.
2020
Guiding Variational Response Generator to Exploit Persona
Bowen Wu
|
MengYuan Li
|
Zongsheng Wang
|
Yifu Chen
|
Derek F. Wong
|
Qihang Feng
|
Junhong Huang
|
Baoxun Wang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Leveraging persona information of users in Neural Response Generators (NRG) to perform personalized conversations has been considered as an attractive and important topic in the research of conversational agents over the past few years. Despite of the promising progress achieved by recent studies in this field, persona information tends to be incorporated into neural networks in the form of user embeddings, with the expectation that the persona can be involved via End-to-End learning. This paper proposes to adopt the personality-related characteristics of human conversations into variational response generators, by designing a specific conditional variational autoencoder based deep model with two new regularization terms employed to the loss function, so as to guide the optimization towards the direction of generating both persona-aware and relevant responses. Besides, to reasonably evaluate the performances of various persona modeling approaches, this paper further presents three direct persona-oriented metrics from different perspectives. The experimental results have shown that our proposed methodology can notably improve the performance of persona-aware response generation, and the metrics are reasonable to evaluate the results.
Search
Fix author
Co-authors
- Siyu Chen (陈思瑜) 1
- Qihang Feng 1
- Jinzheng He 1
- Junhong Huang 1
- Shengpeng Ji 1
- show all...
Venues
- acl2