Multimodal Weighted Fusion of Transformers for Movie Genre Classification
Isaac Rodríguez Bribiesca, Adrián Pastor López Monroy, Manuel Montes-y-Gómez
Abstract
The Multimodal Transformer showed to be a competitive model for multimodal tasks involving textual, visual and audio signals. However, as more modalities are involved, its late fusion by concatenation starts to have a negative impact on the model’s performance. Besides, interpreting model’s predictions becomes difficult, as one would have to look at the different attention activation matrices. In order to overcome these shortcomings, we propose to perform late fusion by adding a GMU module, which effectively allows the model to weight modalities at instance level, improving its performance while providing a better interpretabilty mechanism. In the experiments, we compare our proposed model (MulT-GMU) against the original implementation (MulT-Concat) and a SOTA model tested in a movie genre classification dataset. Our approach, MulT-GMU, outperforms both, MulT-Concat and previous SOTA model.- Anthology ID:
- 2021.maiworkshop-1.1
- Volume:
- Proceedings of the Third Workshop on Multimodal Artificial Intelligence
- Month:
- June
- Year:
- 2021
- Address:
- Mexico City, Mexico
- Venue:
- maiworkshop
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1–5
- Language:
- URL:
- https://aclanthology.org/2021.maiworkshop-1.1
- DOI:
- 10.18653/v1/2021.maiworkshop-1.1
- Cite (ACL):
- Isaac Rodríguez Bribiesca, Adrián Pastor López Monroy, and Manuel Montes-y-Gómez. 2021. Multimodal Weighted Fusion of Transformers for Movie Genre Classification. In Proceedings of the Third Workshop on Multimodal Artificial Intelligence, pages 1–5, Mexico City, Mexico. Association for Computational Linguistics.
- Cite (Informal):
- Multimodal Weighted Fusion of Transformers for Movie Genre Classification (Rodríguez Bribiesca et al., maiworkshop 2021)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2021.maiworkshop-1.1.pdf