Looking Beyond the Pixels: Evaluating Visual Metaphor Understanding in VLMs

Manishit Kundu; Sumit Shekhar; Pushpak Bhattacharyya

doi:10.18653/v1/2025.findings-emnlp.1257

Looking Beyond the Pixels: Evaluating Visual Metaphor Understanding in VLMs

Manishit Kundu, Sumit Shekhar, Pushpak Bhattacharyya

Abstract

Visual metaphors are a complex vision–language phenomenon that requires both perceptual and conceptual reasoning to understand. They provide a valuable test of a model’s ability to interpret visual input and reason about it with creativity and coherence. We introduce ImageMet, a visual metaphor dataset, featuring 2177 synthetic and 350 human-annotated images. We benchmark several SOTA VLMs on two tasks: Visual Metaphor Captioning (VMC) and Visual Metaphor VQA (VM-VQA). We establish strong baselines by fine-tuning on ImageMet, which yields substantial performance gains in VMC (+4.67% SBERT-Similarity, +4.84% task-specific metric) and VM-VQA (+9.3% Accuracy on average). Additionally, we introduce a task-specific CoT prompting strategy that outperforms standard few-shot baselines (+1.99% in VMC, +5.21% in VM-VQA). We observe that despite strong performance on the VMC task, VLMs still significantly lag behind humans in understanding visual metaphors, indicating that their success often relies on learned associations rather than genuine analytical reasoning. We note that this gap is often obscured in metaphor captioning tasks where the automatic metrics correlate only moderately at best with human judgment (Pearson r < 0.6), highlighting the need for careful, holistic evaluation of the visual metaphor understanding of the models.

Anthology ID:: 2025.findings-emnlp.1257
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 23137–23158
Language:
URL:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1257/
DOI:: 10.18653/v1/2025.findings-emnlp.1257
Bibkey:
Cite (ACL):: Manishit Kundu, Sumit Shekhar, and Pushpak Bhattacharyya. 2025. Looking Beyond the Pixels: Evaluating Visual Metaphor Understanding in VLMs. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 23137–23158, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Looking Beyond the Pixels: Evaluating Visual Metaphor Understanding in VLMs (Kundu et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1257.pdf
Checklist:: 2025.findings-emnlp.1257.checklist.pdf

PDF Cite Search Checklist Fix data