Abstract
Existing works in multimodal affective computing tasks, such as emotion recognition and personality recognition, generally adopt a two-phase pipeline by first extracting feature representations for each single modality with hand crafted algorithms, and then performing end-to-end learning with extracted features. However, the extracted features are fixed and cannot be further fine-tuned on different target tasks, and manually finding feature extracting algorithms does not generalize or scale well to different tasks, which can lead to sub-optimal performance. In this paper, we develop a fully end-to-end model that connects the two phases and optimizes them jointly. In addition, we restructure the current datasets to enable the fully end-to-end training. Furthermore, to reduce the computational overhead brought by the end-to-end model, we introduce a sparse cross-modal attention mechanism for the feature extraction. Experimental results show that our fully end-to-end model significantly surpasses the current state-of-the-art models based on the two-phase pipeline. Moreover, by adding the sparse cross-modal attention, our model can maintain the performance with around half less computation in the feature extraction part of the model.- Anthology ID:
- 2021.naacl-main.417
- Volume:
- Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
- Month:
- June
- Year:
- 2021
- Address:
- Online
- Venue:
- NAACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5305–5316
- Language:
- URL:
- https://aclanthology.org/2021.naacl-main.417
- DOI:
- 10.18653/v1/2021.naacl-main.417
- Cite (ACL):
- Wenliang Dai, Samuel Cahyawijaya, Zihan Liu, and Pascale Fung. 2021. Multimodal End-to-End Sparse Model for Emotion Recognition. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5305–5316, Online. Association for Computational Linguistics.
- Cite (Informal):
- Multimodal End-to-End Sparse Model for Emotion Recognition (Dai et al., NAACL 2021)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2021.naacl-main.417.pdf
- Code
- wenliangdai/Multimodal-End2end-Sparse
- Data
- CMU-MOSEI, IEMOCAP