Speech-Text Pre-training for Spoken Dialog Understanding with Explicit Cross-Modal Alignment

Tianshu Yu; Haoyu Gao; Ting-En Lin; Min Yang; Yuchuan Wu; Wentao Ma; Chao Wang; Fei Huang; Yongbin Li

doi:10.18653/v1/2023.acl-long.438

Speech-Text Pre-training for Spoken Dialog Understanding with Explicit Cross-Modal Alignment

Tianshu Yu, Haoyu Gao, Ting-En Lin, Min Yang, Yuchuan Wu, Wentao Ma, Chao Wang, Fei Huang, Yongbin Li

Abstract

Recently, speech-text pre-training methods have shown remarkable success in many speech and natural language processing tasks. However, most previous pre-trained models are usually tailored for one or two specific tasks, but fail to conquer a wide range of speech-text tasks. In addition, existing speech-text pre-training methods fail to explore the contextual information within a dialogue to enrich utterance representations. In this paper, we propose Speech-text Pre-training for spoken dialog understanding with ExpliCiT cRoss-Modal Alignment (SPECTRA), which is the first-ever speech-text dialog pre-training model. Concretely, to consider the temporality of speech modality, we design a novel temporal position prediction task to capture the speech-text alignment. This pre-training task aims to predict the start and end time of each textual word in the corresponding speech waveform. In addition, to learn the characteristics of spoken dialogs, we generalize a response selection task from textual dialog pre-training to speech-text dialog pre-training scenarios. Experimental results on four different downstream speech-text tasks demonstrate the superiority of SPECTRA in learning speech-text alignment and multi-turn dialog context.

Anthology ID:: 2023.acl-long.438
Volume:: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7900–7913
Language:
URL:: https://aclanthology.org/2023.acl-long.438
DOI:: 10.18653/v1/2023.acl-long.438
Bibkey:
Cite (ACL):: Tianshu Yu, Haoyu Gao, Ting-En Lin, Min Yang, Yuchuan Wu, Wentao Ma, Chao Wang, Fei Huang, and Yongbin Li. 2023. Speech-Text Pre-training for Spoken Dialog Understanding with Explicit Cross-Modal Alignment. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7900–7913, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: Speech-Text Pre-training for Spoken Dialog Understanding with Explicit Cross-Modal Alignment (Yu et al., ACL 2023)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-2024-clasp/2023.acl-long.438.pdf
Video:: https://preview.aclanthology.org/ingest-2024-clasp/2023.acl-long.438.mp4

PDF Search Video