Let’s Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation

Se Park; Chae Kim; Hyeongseop Rha; Minsu Kim; Joanna Hong; Jeonghun Yeo; Yong Ro

doi:10.18653/v1/2024.acl-long.860

Let’s Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation

Se Park, Chae Kim, Hyeongseop Rha, Minsu Kim, Joanna Hong, Jeonghun Yeo, Yong Ro

Abstract

In this paper, we introduce a novel Face-to-Face spoken dialogue model. It processes audio-visual speech from user input and generates audio-visual speech as the response, marking the initial step towards creating an avatar chatbot system without relying on intermediate text. To this end, we newly introduce MultiDialog, the first large-scale multimodal (i.e, audio and visual) spoken dialogue corpus containing 340 hours of approximately 9,000 dialogues, recorded based on the open domain dialogue dataset, TopicalChat. The MultiDialog contains parallel audio-visual recordings of conversation partners acting according to the given script with emotion annotations, which we expect to open up research opportunities in multimodal synthesis. Our Face-to-Face spoken dialogue model incorporates a textually pretrained large language model and adapts it into the audio-visual spoken dialogue domain by incorporating speech-text joint pretraining. Through extensive experiments, we validate the effectiveness of our model in facilitating a face-to-face conversation. Demo and data are available at https://multidialog.github.io and https://huggingface.co/datasets/IVLLab/MultiDialog, respectively.

Anthology ID:: 2024.acl-long.860
Volume:: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 16334–16348
Language:
URL:: https://aclanthology.org/2024.acl-long.860
DOI:: 10.18653/v1/2024.acl-long.860
Bibkey:
Cite (ACL):: Se Park, Chae Kim, Hyeongseop Rha, Minsu Kim, Joanna Hong, Jeonghun Yeo, and Yong Ro. 2024. Let’s Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16334–16348, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: Let’s Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation (Park et al., ACL 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/add_acl24_videos/2024.acl-long.860.pdf
Video:: https://preview.aclanthology.org/add_acl24_videos/2024.acl-long.860.mp4

PDF Search Video