CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations

Hang Li; Wenbiao Ding; Yu Kang; Tianqiao Liu; Zhongqin Wu; Zitao Liu

doi:10.18653/v1/2021.emnlp-main.323

CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations

Hang Li, Wenbiao Ding, Yu Kang, Tianqiao Liu, Zhongqin Wu, Zitao Liu

Abstract

Existing audio-language task-specific predictive approaches focus on building complicated late-fusion mechanisms. However, these models are facing challenges of overfitting with limited labels and low model generalization abilities. In this paper, we present a Cross-modal Transformer for Audio-and-Language, i.e., CTAL, which aims to learn the intra-modality and inter-modality connections between audio and language through two proxy tasks on a large amount of audio-and-language pairs: masked language modeling and masked cross-modal acoustic modeling. After fine-tuning our pre-trained model on multiple downstream audio-and-language tasks, we observe significant improvements across various tasks, such as, emotion classification, sentiment analysis, and speaker verification. On this basis, we further propose a specially-designed fusion mechanism that can be used in fine-tuning phase, which allows our pre-trained model to achieve better performance. Lastly, we demonstrate detailed ablation studies to prove that both our novel cross-modality fusion component and audio-language pre-training methods significantly contribute to the promising results. The code and pre-trained models are available at https://github.com/tal-ai/CTAL_EMNLP2021.

Anthology ID:: 2021.emnlp-main.323
Volume:: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2021
Address:: Online and Punta Cana, Dominican Republic
Editors:: Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3966–3977
Language:
URL:: https://aclanthology.org/2021.emnlp-main.323
DOI:: 10.18653/v1/2021.emnlp-main.323
Bibkey:
Cite (ACL):: Hang Li, Wenbiao Ding, Yu Kang, Tianqiao Liu, Zhongqin Wu, and Zitao Liu. 2021. CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3966–3977, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):: CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations (Li et al., EMNLP 2021)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-4/2021.emnlp-main.323.pdf
Video:: https://preview.aclanthology.org/nschneid-patch-4/2021.emnlp-main.323.mp4
Code: ydkwim/ctal
Data: IEMOCAP, LibriSpeech

PDF Search Code Video