AbsVis – Benchmarking How Humans and Vision-Language Models “See” Abstract Concepts in Images

Tarun Tater, Diego Frassinelli, Sabine Schulte im Walde


Abstract
Abstract concepts like mercy and peace often lack clear visual grounding, and thus challenge humans and models to provide suitable image representations. To address this challenge, we introduce AbsVis – a dataset of 675 images annotated with 14,175 concept–explanation attributions from humans and two Vision-Language Models (VLMs: Qwen and LLaVA), where each concept is accompanied by a textual explanation. We compare human and VLM attributions in terms of diversity, abstractness, and alignment, and find that humans attribute more varied concepts. AbsVis also includes 2,680 human preference judgments evaluating the quality of a subset of these annotations, showing that overlapping concepts (attributed by both humans and VLMs) are most preferred. Explanations clarify and strengthen the perceived attributions, both from humans and VLMs. Explanations clarify and strengthen the perceived attributions, both from human and VLMs. Finally, we show that VLMs can approximate human preferences and use them to fine-tune VLMs via Direct Preference Optimization (DPO), yielding improved alignments with preferred concept–explanation pairs.
Anthology ID:
2025.emnlp-main.417
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8271–8292
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.417/
DOI:
Bibkey:
Cite (ACL):
Tarun Tater, Diego Frassinelli, and Sabine Schulte im Walde. 2025. AbsVis – Benchmarking How Humans and Vision-Language Models “See” Abstract Concepts in Images. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 8271–8292, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
AbsVis – Benchmarking How Humans and Vision-Language Models “See” Abstract Concepts in Images (Tater et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.417.pdf
Checklist:
 2025.emnlp-main.417.checklist.pdf