Semantic Contrastive Adaptation for Multimodal Figurative Language Understanding

Ayaan Siddiqui


Abstract
Understanding idiomatic and figurative language in images remains a fundamental challenge for vision–language models, as it requires reasoning beyond literal image–text alignment. Although large pretrained models such as CLIP and BLIP-2 perform well on literal recognition, they consistently fail on multimodal figurative benchmarks, often favoring visually salient but semantically literal interpretations. We show that this failure arises from a systematic literal alignment bias rather than limited model capacity. Motivated by this observation, we reformulate multimodal figurative understanding as a contrastive semantic deviation problem, where figurative images must be distinguished from visually plausible literal alternatives. We introduce a parameter-efficient adaptation of CLIP using Low-Rank Adaptation (LoRA) with hard literal negative mining, achieving targeted reshaping of multimodal representations without full fine-tuning. Experiments on the IRFL benchmark across idioms, metaphors, and similes demonstrate substantial improvements over zero-shot CLIP, BLIP- 2, ensemble-based, and knowledge-augmented baselines. Finally, we introduce FIGMENT, a multilingual figurative grounding evaluation spanning five idiom-rich languages, and show that the adapted model generalizes across languages despite being trained exclusively on English supervision.
Anthology ID:
2026.acl-srw.12
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Santosh T.Y.S.S., Juan Diego Rodriguez, Ona de Gibert
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
142–151
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-srw.12/
DOI:
Bibkey:
Cite (ACL):
Ayaan Siddiqui. 2026. Semantic Contrastive Adaptation for Multimodal Figurative Language Understanding. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), pages 142–151, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Semantic Contrastive Adaptation for Multimodal Figurative Language Understanding (Siddiqui, ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-srw.12.pdf