Ayaan Siddiqui

2026

Semantic Contrastive Adaptation for Multimodal Figurative Language Understanding
Ayaan Siddiqui
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

Understanding idiomatic and figurative language in images remains a fundamental challenge for vision–language models, as it requires reasoning beyond literal image–text alignment. Although large pretrained models such as CLIP and BLIP-2 perform well on literal recognition, they consistently fail on multimodal figurative benchmarks, often favoring visually salient but semantically literal interpretations. We show that this failure arises from a systematic literal alignment bias rather than limited model capacity. Motivated by this observation, we reformulate multimodal figurative understanding as a contrastive semantic deviation problem, where figurative images must be distinguished from visually plausible literal alternatives. We introduce a parameter-efficient adaptation of CLIP using Low-Rank Adaptation (LoRA) with hard literal negative mining, achieving targeted reshaping of multimodal representations without full fine-tuning. Experiments on the IRFL benchmark across idioms, metaphors, and similes demonstrate substantial improvements over zero-shot CLIP, BLIP- 2, ensemble-based, and knowledge-augmented baselines. Finally, we introduce FIGMENT, a multilingual figurative grounding evaluation spanning five idiom-rich languages, and show that the adapted model generalizes across languages despite being trained exclusively on English supervision.

Co-authors

Venues

ACL1

Fix author