If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions

Reza Esfandiarpoor, Cristina Menghini, Stephen Bach


Abstract
Recent works often assume that Vision-Language Model (VLM) representations are based on visual attributes like shape. However, it is unclear to what extent VLMs prioritize this information to represent concepts. We propose Extract and Explore (EX2), a novel approach to characterize textual features that are important for VLMs. EX2 uses reinforcement learning to align a large language model with VLM preferences and generates descriptions that incorporate features that are important for the VLM. Then, we inspect the descriptions to identify features that contribute to VLM representations. Using EX2, we find that spurious descriptions have a major role in VLM representations despite providing no helpful information, e.g., Click to enlarge photo of CONCEPT. More importantly, among informative descriptions, VLMs rely significantly on non-visual attributes like habitat (e.g., North America) to represent visual concepts. Also, our analysis reveals that different VLMs prioritize different attributes in their representations. Overall, we show that VLMs do not simply match images to scene descriptions and that non-visual or even spurious descriptions significantly influence their representations.
Anthology ID:
2024.emnlp-main.547
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9797–9819
Language:
URL:
https://aclanthology.org/2024.emnlp-main.547
DOI:
10.18653/v1/2024.emnlp-main.547
Bibkey:
Cite (ACL):
Reza Esfandiarpoor, Cristina Menghini, and Stephen Bach. 2024. If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9797–9819, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions (Esfandiarpoor et al., EMNLP 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/dois-2013-emnlp/2024.emnlp-main.547.pdf
Software:
 2024.emnlp-main.547.software.zip