VCD: A Dataset for Visual Commonsense Discovery in Images

Xiangqing Shen, Fanfan Wang, Siwei Wu, Rui Xia


Abstract
Visual commonsense plays a vital role in understanding and reasoning about the visual world. While commonsense knowledge bases like ConceptNet provide structured collections of general facts, they lack visually grounded representations. Scene graph datasets like Visual Genome, though rich in object-level descriptions, primarily focus on directly observable information and lack systematic categorization of commonsense knowledge. We present Visual Commonsense Dataset (VCD), a large-scale dataset containing over 100,000 images and 14 million object-commonsense pairs that bridges this gap. VCD introduces a novel three-level taxonomy for visual commonsense, integrating both Seen (directly observable) and Unseen (inferrable) commonsense across Property, Action, and Space aspects. Each commonsense is represented as a triple where the head entity is grounded to object bounding boxes in images, enabling scene-dependent and object-specific visual commonsense representation. To demonstrate VCD’s utility, we develop VCM, a generative model that combines a vision-language model with instruction tuning to discover diverse visual commonsense from images. Extensive evaluations demonstrate both the high quality of VCD and its value as a resource for advancing visually grounded commonsense understanding and reasoning. Our dataset and code will be released on https://github.com/NUSTM/VCD.
Anthology ID:
2025.findings-acl.290
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5562–5577
Language:
URL:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.290/
DOI:
Bibkey:
Cite (ACL):
Xiangqing Shen, Fanfan Wang, Siwei Wu, and Rui Xia. 2025. VCD: A Dataset for Visual Commonsense Discovery in Images. In Findings of the Association for Computational Linguistics: ACL 2025, pages 5562–5577, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
VCD: A Dataset for Visual Commonsense Discovery in Images (Shen et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.290.pdf