Understanding Attention for Vision-and-Language Tasks

Feiqi Cao, Soyeon Caren Han, Siqu Long, Changwei Xu, Josiah Poon


Abstract
Attention mechanism has been used as an important component across Vision-and-Language(VL) tasks in order to bridge the semantic gap between visual and textual features. While attention has been widely used in VL tasks, it has not been examined the capability of different attention alignment calculation in bridging the semantic gap between visual and textual clues. In this research, we conduct a comprehensive analysis on understanding the role of attention alignment by looking into the attention score calculation methods and check how it actually represents the visual region’s and textual token’s significance for the global assessment. We also analyse the conditions which attention score calculation mechanism would be more (or less) interpretable, and which may impact the model performance on three different VL tasks, including visual question answering, text-to-image generation, text-and-image matching (both sentence and image retrieval). Our analysis is the first of its kind and provides useful insights of the importance of each attention alignment score calculation when applied at the training phase of VL tasks, commonly ignored in attention-based cross modal models, and/or pretrained models. Our code is available at: https://github.com/adlnlp/Attention_VL
Anthology ID:
2022.coling-1.304
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Editors:
Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, Seung-Hoon Na
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
3438–3453
Language:
URL:
https://aclanthology.org/2022.coling-1.304
DOI:
Bibkey:
Cite (ACL):
Feiqi Cao, Soyeon Caren Han, Siqu Long, Changwei Xu, and Josiah Poon. 2022. Understanding Attention for Vision-and-Language Tasks. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3438–3453, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
Understanding Attention for Vision-and-Language Tasks (Cao et al., COLING 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-2023-videos/2022.coling-1.304.pdf
Code
 adlnlp/attention_vl
Data
TextVQA