Probing Contextual Language Models for Common Ground with Visual Representations

Gabriel Ilharco, Rowan Zellers, Ali Farhadi, Hannaneh Hajishirzi


Abstract
The success of large-scale contextual language models has attracted great interest in probing what is encoded in their representations. In this work, we consider a new question: to what extent contextual representations of concrete nouns are aligned with corresponding visual representations? We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations. Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories. Moreover, they are effective in retrieving specific instances of image patches; textual context plays an important role in this process. Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans. We hope our analyses inspire future research in understanding and improving the visual capabilities of language models.
Anthology ID:
2021.naacl-main.422
Volume:
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
June
Year:
2021
Address:
Online
Editors:
Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, Yichao Zhou
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5367–5377
Language:
URL:
https://aclanthology.org/2021.naacl-main.422
DOI:
10.18653/v1/2021.naacl-main.422
Bibkey:
Cite (ACL):
Gabriel Ilharco, Rowan Zellers, Ali Farhadi, and Hannaneh Hajishirzi. 2021. Probing Contextual Language Models for Common Ground with Visual Representations. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5367–5377, Online. Association for Computational Linguistics.
Cite (Informal):
Probing Contextual Language Models for Common Ground with Visual Representations (Ilharco et al., NAACL 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp22-frontmatter/2021.naacl-main.422.pdf
Video:
 https://preview.aclanthology.org/emnlp22-frontmatter/2021.naacl-main.422.mp4
Data
MS COCOVisual GenomeVisual Question Answering