Abstract
We explore how a multi-modal transformer trained for generation of longer image descriptions learns syntactic and semantic representations about entities and relations grounded in objects at the level of masked self-attention (text generation) and cross-modal attention (information fusion). We observe that cross-attention learns the visual grounding of noun phrases into objects and high-level semantic information about spatial relations, while text-to-text attention captures low-level syntactic knowledge between words. This concludes that language models in a multi-modal task learn different semantic information about objects and relations cross-modally and uni-modally (text-only). Our code is available here: https://github.com/GU-CLASP/attention-as-grounding.- Anthology ID:
- 2022.findings-acl.320
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2022
- Month:
- May
- Year:
- 2022
- Address:
- Dublin, Ireland
- Editors:
- Smaranda Muresan, Preslav Nakov, Aline Villavicencio
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 4062–4073
- Language:
- URL:
- https://aclanthology.org/2022.findings-acl.320
- DOI:
- 10.18653/v1/2022.findings-acl.320
- Cite (ACL):
- Nikolai Ilinykh and Simon Dobnik. 2022. Attention as Grounding: Exploring Textual and Cross-Modal Attention on Entities and Relations in Language-and-Vision Transformer. In Findings of the Association for Computational Linguistics: ACL 2022, pages 4062–4073, Dublin, Ireland. Association for Computational Linguistics.
- Cite (Informal):
- Attention as Grounding: Exploring Textual and Cross-Modal Attention on Entities and Relations in Language-and-Vision Transformer (Ilinykh & Dobnik, Findings 2022)
- PDF:
- https://preview.aclanthology.org/bionlp-24-ingestion/2022.findings-acl.320.pdf
- Code
- gu-clasp/attention-as-grounding
- Data
- Image Description Sequences