Attention as Grounding: Exploring Textual and Cross-Modal Attention on Entities and Relations in Language-and-Vision Transformer

Nikolai Ilinykh; Simon Dobnik

doi:10.18653/v1/2022.findings-acl.320

Attention as Grounding: Exploring Textual and Cross-Modal Attention on Entities and Relations in Language-and-Vision Transformer

Abstract

We explore how a multi-modal transformer trained for generation of longer image descriptions learns syntactic and semantic representations about entities and relations grounded in objects at the level of masked self-attention (text generation) and cross-modal attention (information fusion). We observe that cross-attention learns the visual grounding of noun phrases into objects and high-level semantic information about spatial relations, while text-to-text attention captures low-level syntactic knowledge between words. This concludes that language models in a multi-modal task learn different semantic information about objects and relations cross-modally and uni-modally (text-only). Our code is available here: https://github.com/GU-CLASP/attention-as-grounding.

Anthology ID:: 2022.findings-acl.320
Volume:: Findings of the Association for Computational Linguistics: ACL 2022
Month:: May
Year:: 2022
Address:: Dublin, Ireland
Editors:: Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4062–4073
Language:
URL:: https://aclanthology.org/2022.findings-acl.320
DOI:: 10.18653/v1/2022.findings-acl.320
Bibkey:
Cite (ACL):: Nikolai Ilinykh and Simon Dobnik. 2022. Attention as Grounding: Exploring Textual and Cross-Modal Attention on Entities and Relations in Language-and-Vision Transformer. In Findings of the Association for Computational Linguistics: ACL 2022, pages 4062–4073, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):: Attention as Grounding: Exploring Textual and Cross-Modal Attention on Entities and Relations in Language-and-Vision Transformer (Ilinykh & Dobnik, Findings 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/bionlp-24-ingestion/2022.findings-acl.320.pdf
Video:: https://preview.aclanthology.org/bionlp-24-ingestion/2022.findings-acl.320.mp4
Code: gu-clasp/attention-as-grounding
Data: Image Description Sequences

PDF Search Code Video