Abstract
Humans interpret visual aspects of objects based on contexts. For example, a banana appears brown when rotten and green when unripe. Previous studies focused on language models’ grasp of typical object properties. We introduce WINOVIZ, a text-only dataset with 1,380 examples of probing language models’ reasoning about diverse visual properties under different contexts. Our task demands pragmatic and visual knowledge reasoning. We also present multi-hop data, a more challenging version requiring multi-step reasoning chains. Experimental findings include: a) GPT-4 excels overall but struggles with multi-hop data. b) Large models perform well in pragmatic reasoning but struggle with visual knowledge reasoning. c) Vision-language models outperform language-only models.- Anthology ID:
- 2024.insights-1.14
- Volume:
- Proceedings of the Fifth Workshop on Insights from Negative Results in NLP
- Month:
- June
- Year:
- 2024
- Address:
- Mexico City, Mexico
- Editors:
- Shabnam Tafreshi, Arjun Akula, João Sedoc, Aleksandr Drozd, Anna Rogers, Anna Rumshisky
- Venues:
- insights | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 110–123
- Language:
- URL:
- https://aclanthology.org/2024.insights-1.14
- DOI:
- Cite (ACL):
- Woojeong Jin, Tejas Srinivasan, Jesse Thomason, and Xiang Ren. 2024. WINOVIZ: Probing Visual Properties of Objects Under Different States. In Proceedings of the Fifth Workshop on Insights from Negative Results in NLP, pages 110–123, Mexico City, Mexico. Association for Computational Linguistics.
- Cite (Informal):
- WINOVIZ: Probing Visual Properties of Objects Under Different States (Jin et al., insights-WS 2024)
- PDF:
- https://preview.aclanthology.org/jeptaln-2024-ingestion/2024.insights-1.14.pdf