Attending Self-Attention: A Case Study of Visually Grounded Supervision in Vision-and-Language Transformers

Jules Samaran; Noa Garcia; Mayu Otani; Chenhui Chu; Yuta Nakashima

doi:10.18653/v1/2021.acl-srw.8

Attending Self-Attention: A Case Study of Visually Grounded Supervision in Vision-and-Language Transformers

Jules Samaran, Noa Garcia, Mayu Otani, Chenhui Chu, Yuta Nakashima

Abstract

The impressive performances of pre-trained visually grounded language models have motivated a growing body of research investigating what has been learned during the pre-training. As a lot of these models are based on Transformers, several studies on the attention mechanisms used by the models to learn to associate phrases with their visual grounding in the image have been conducted. In this work, we investigate how supervising attention directly to learn visual grounding can affect the behavior of such models. We compare three different methods on attention supervision and their impact on the performances of a state-of-the-art visually grounded language model on two popular vision-and-language tasks.

Anthology ID:: 2021.acl-srw.8
Volume:: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop
Month:: August
Year:: 2021
Address:: Online
Editors:: Jad Kabbara, Haitao Lin, Amandalynne Paullada, Jannis Vamvas
Venues:: ACL | IJCNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 81–86
Language:
URL:: https://aclanthology.org/2021.acl-srw.8
DOI:: 10.18653/v1/2021.acl-srw.8
Bibkey:
Cite (ACL):: Jules Samaran, Noa Garcia, Mayu Otani, Chenhui Chu, and Yuta Nakashima. 2021. Attending Self-Attention: A Case Study of Visually Grounded Supervision in Vision-and-Language Transformers. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop, pages 81–86, Online. Association for Computational Linguistics.
Cite (Informal):: Attending Self-Attention: A Case Study of Visually Grounded Supervision in Vision-and-Language Transformers (Samaran et al., ACL-IJCNLP 2021)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-1/2021.acl-srw.8.pdf
Optional supplementary material:: 2021.acl-srw.8.OptionalSupplementaryMaterial.pdf
Video:: https://preview.aclanthology.org/nschneid-patch-1/2021.acl-srw.8.mp4
Data: Conceptual Captions, RefCOCO

PDF Search Optional supplementary material Video