Manishit Kundu

2026

CaRVE: Critiquing and Refining Visual Elaborations for Figurative Language Illustrations
Manishit Kundu | Tejomay Kishor Padole | Sumit Shekhar | Biplab Banerjee | Pushpak Bhattacharyya
Findings of the Association for Computational Linguistics: ACL 2026

Illustrating figurative language remains challenging due to its non-literal semantics, and existing text-to-image frameworks rely heavily on proprietary models or human supervision to achieve adequate alignment. We introduce CaRVE, a lightweight and fully open-source critique-driven framework that employs VLM feedback to refine visual elaborations for figurative image generation. CaRVE bridges the semantic alignment gap even in sub-4B models by correcting visual and conceptual misalignments, reducing over-literalization, and improving robustness to complex figurative expressions. Using only open-source models, CaRVE achieves a 6.49% improvement over prior baselines on intrinsic automatic evaluations and a +0.37 average rank gain in human preference. We further release MetaCaRVE, an enhanced figurative image dataset constructed by refining HAIVMet using CaRVE.

2025

pdf bib abs

Looking Beyond the Pixels: Evaluating Visual Metaphor Understanding in VLMs
Manishit Kundu | Sumit Shekhar | Pushpak Bhattacharyya
Findings of the Association for Computational Linguistics: EMNLP 2025

Visual metaphors are a complex vision–language phenomenon that requires both perceptual and conceptual reasoning to understand. They provide a valuable test of a model’s ability to interpret visual input and reason about it with creativity and coherence. We introduce ImageMet, a visual metaphor dataset, featuring 2177 synthetic and 350 human-annotated images. We benchmark several SOTA VLMs on two tasks: Visual Metaphor Captioning (VMC) and Visual Metaphor VQA (VM-VQA). We establish strong baselines by fine-tuning on ImageMet, which yields substantial performance gains in VMC (+4.67% SBERT-Similarity, +4.84% task-specific metric) and VM-VQA (+9.3% Accuracy on average). Additionally, we introduce a task-specific CoT prompting strategy that outperforms standard few-shot baselines (+1.99% in VMC, +5.21% in VM-VQA). We observe that despite strong performance on the VMC task, VLMs still significantly lag behind humans in understanding visual metaphors, indicating that their success often relies on learned associations rather than genuine analytical reasoning. We note that this gap is often obscured in metaphor captioning tasks where the automatic metrics correlate only moderately at best with human judgment (Pearson r < 0.6), highlighting the need for careful, holistic evaluation of the visual metaphor understanding of the models.

Co-authors

Venues

Findings2

Fix author