MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering

Aisha Urooj; Amir Mazaheri; Niels Da vitoria lobo; Mubarak Shah

doi:10.18653/v1/2020.findings-emnlp.417

MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering

Aisha Urooj, Amir Mazaheri, Niels Da vitoria lobo, Mubarak Shah

Abstract

We present MMFT-BERT(MultiModal FusionTransformer with BERT encodings), to solve Visual Question Answering (VQA) ensuring individual and combined processing of multiple input modalities. Our approach benefits from processing multimodal data (video and text) adopting the BERT encodings individually and using a novel transformer-based fusion method to fuse them together. Our method decomposes the different sources of modalities, into different BERT instances with similar architectures, but variable weights. This achieves SOTA results on the TVQA dataset. Additionally, we provide TVQA-Visual, an isolated diagnostic subset of TVQA, which strictly requires the knowledge of visual (V) modality based on a human annotator’s judgment. This set of questions helps us to study the model’s behavior and the challenges TVQA poses to prevent the achievement of super human performance. Extensive experiments show the effectiveness and superiority of our method.

Anthology ID:: 2020.findings-emnlp.417
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2020
Month:: November
Year:: 2020
Address:: Online
Editors:: Trevor Cohn, Yulan He, Yang Liu
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4648–4660
Language:
URL:: https://aclanthology.org/2020.findings-emnlp.417
DOI:: 10.18653/v1/2020.findings-emnlp.417
Bibkey:
Cite (ACL):: Aisha Urooj, Amir Mazaheri, Niels Da vitoria lobo, and Mubarak Shah. 2020. MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4648–4660, Online. Association for Computational Linguistics.
Cite (Informal):: MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering (Urooj et al., Findings 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/naacl24-info/2020.findings-emnlp.417.pdf
Code: aurooj/MMFT-BERT
Data: TVQA, TVQA+, Visual Question Answering

PDF Search Code