Probing Contextual Language Models for Common Ground with Visual Representations
Gabriel Ilharco, Rowan Zellers, Ali Farhadi, Hannaneh Hajishirzi
Abstract
The success of large-scale contextual language models has attracted great interest in probing what is encoded in their representations. In this work, we consider a new question: to what extent contextual representations of concrete nouns are aligned with corresponding visual representations? We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations. Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories. Moreover, they are effective in retrieving specific instances of image patches; textual context plays an important role in this process. Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans. We hope our analyses inspire future research in understanding and improving the visual capabilities of language models.- Anthology ID:
- 2021.naacl-main.422
- Volume:
- Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
- Month:
- June
- Year:
- 2021
- Address:
- Online
- Editors:
- Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, Yichao Zhou
- Venue:
- NAACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5367–5377
- Language:
- URL:
- https://aclanthology.org/2021.naacl-main.422
- DOI:
- 10.18653/v1/2021.naacl-main.422
- Cite (ACL):
- Gabriel Ilharco, Rowan Zellers, Ali Farhadi, and Hannaneh Hajishirzi. 2021. Probing Contextual Language Models for Common Ground with Visual Representations. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5367–5377, Online. Association for Computational Linguistics.
- Cite (Informal):
- Probing Contextual Language Models for Common Ground with Visual Representations (Ilharco et al., NAACL 2021)
- PDF:
- https://preview.aclanthology.org/emnlp22-frontmatter/2021.naacl-main.422.pdf
- Data
- MS COCO, Visual Genome, Visual Question Answering