Optimizing Conversational Quality in Spoken Dialogue Systems with Reinforcement Learning from AI Feedback

Siddhant Arora; Jinchuan Tian; Jiatong Shi; Hayato Futami; Yosuke Kashiwagi; Emiru Tsunoo; Shinji Watanabe

Optimizing Conversational Quality in Spoken Dialogue Systems with Reinforcement Learning from AI Feedback

Siddhant Arora, Jinchuan Tian, Jiatong Shi, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe

Abstract

Reinforcement learning from human or AI feedback (RLHF/RLAIF) for speech-in/speech-out dialogue systems (SDS) remains underexplored, with prior work largely limited to single semantic rewards applied at the utterance level. Such setups overlook the multi-dimensional and multi-modal nature of conversational quality, which encompasses semantic coherence, audio naturalness, speaker consistency, emotion alignment, and turn-taking behavior. Moreover, they are fundamentally mismatched with duplex spoken dialogue systems that generate responses incrementally, where agents must make decisions based on partial utterances. We address these limitations with the first multi-reward RLAIF framework for SDS, combining semantic, audio-quality, and emotion-consistency rewards. To align utterance-level preferences with incremental, blockwise decoding in duplex models, we apply turn-level preference sampling and aggregate per-block log-probabilities within a single DPO objective. We present the first systematic study of preference learning for improving SDS quality in both multi-turn Chain-of-Thought and blockwise duplex models, and release a multi-reward DPO dataset to support reproducible research. Experiments show that single-reward RLAIF selectively improves its targeted metric, while joint multi-reward training yields consistent gains across semantic quality and audio naturalness. These results highlight the importance of holistic, multi-reward alignment for practical conversational SDS.

Anthology ID:: 2026.findings-acl.2040
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 41049–41066
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2040/
DOI:
Bibkey:
Cite (ACL):: Siddhant Arora, Jinchuan Tian, Jiatong Shi, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, and Shinji Watanabe. 2026. Optimizing Conversational Quality in Spoken Dialogue Systems with Reinforcement Learning from AI Feedback. In Findings of the Association for Computational Linguistics: ACL 2026, pages 41049–41066, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Optimizing Conversational Quality in Spoken Dialogue Systems with Reinforcement Learning from AI Feedback (Arora et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2040.pdf
Checklist:: 2026.findings-acl.2040.checklist.pdf

PDF Cite Search Checklist Fix data