Manishit Kundu
2025
Looking Beyond the Pixels: Evaluating Visual Metaphor Understanding in VLMs
Manishit Kundu
|
Sumit Shekhar
|
Pushpak Bhattacharyya
Findings of the Association for Computational Linguistics: EMNLP 2025
Visual metaphors are a complex vision–language phenomenon that requires both perceptual and conceptual reasoning to understand. They provide a valuable test of a model’s ability to interpret visual input and reason about it with creativity and coherence. We introduce ImageMet, a visual metaphor dataset, featuring 2177 synthetic and 350 human-annotated images. We benchmark several SOTA VLMs on two tasks: Visual Metaphor Captioning (VMC) and Visual Metaphor VQA (VM-VQA). We establish strong baselines by fine-tuning on ImageMet, which yields substantial performance gains in VMC (+4.67% SBERT-Similarity, +4.84% task-specific metric) and VM-VQA (+9.3% Accuracy on average). Additionally, we introduce a task-specific CoT prompting strategy that outperforms standard few-shot baselines (+1.99% in VMC, +5.21% in VM-VQA). We observe that despite strong performance on the VMC task, VLMs still significantly lag behind humans in understanding visual metaphors, indicating that their success often relies on learned associations rather than genuine analytical reasoning. We note that this gap is often obscured in metaphor captioning tasks where the automatic metrics correlate only moderately at best with human judgment (Pearson r < 0.6), highlighting the need for careful, holistic evaluation of the visual metaphor understanding of the models.