Low Rank Fusion based Transformers for Multimodal Sequences

Saurav Sahay, Eda Okur, Shachi H Kumar, Lama Nachman


Abstract
Our senses individually work in a coordinated fashion to express our emotional intentions. In this work, we experiment with modeling modality-specific sensory signals to attend to our latent multimodal emotional intentions and vice versa expressed via low-rank multimodal fusion and multimodal transformers. The low-rank factorization of multimodal fusion amongst the modalities helps represent approximate multiplicative latent signal interactions. Motivated by the work of~(CITATION) and~(CITATION), we present our transformer-based cross-fusion architecture without any over-parameterization of the model. The low-rank fusion helps represent the latent signal interactions while the modality-specific attention helps focus on relevant parts of the signal. We present two methods for the Multimodal Sentiment and Emotion Recognition results on CMU-MOSEI, CMU-MOSI, and IEMOCAP datasets and show that our models have lesser parameters, train faster and perform comparably to many larger fusion-based architectures.
Anthology ID:
2020.challengehml-1.4
Volume:
Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML)
Month:
July
Year:
2020
Address:
Seattle, USA
Venue:
Challenge-HML
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
29–34
Language:
URL:
https://aclanthology.org/2020.challengehml-1.4
DOI:
10.18653/v1/2020.challengehml-1.4
Bibkey:
Cite (ACL):
Saurav Sahay, Eda Okur, Shachi H Kumar, and Lama Nachman. 2020. Low Rank Fusion based Transformers for Multimodal Sequences. In Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML), pages 29–34, Seattle, USA. Association for Computational Linguistics.
Cite (Informal):
Low Rank Fusion based Transformers for Multimodal Sequences (Sahay et al., Challenge-HML 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/auto-file-uploads/2020.challengehml-1.4.pdf
Video:
 http://slideslive.com/38931264
Data
CMU-MOSEIIEMOCAP