Making LVLMs Look Twice: Contrastive Decoding with Contrast Images

Avshalom Manevich, Reut Tsarfaty


Abstract
Large Vision-Language Models (LVLMs) are becoming increasingly popular for text-vision tasks requiring cross-modal reasoning, but often struggle with fine-grained visual discrimination. This limitation is evident in recent benchmarks like NaturalBench and D3, where closed models such as GPT-4o achieve only 39.6%, and open-source models perform below random chance (25%). We introduce Contrastive decoding with Contrast Images (CoCI), which adjusts LVLM outputs by contrasting them against outputs for similar images (Contrast Images - CIs). CoCI demonstrates strong performance across three distinct supervision regimes. First, when using naturally occurring CIs in benchmarks with curated image pairs, we achieve improvements of up to 98.9% on NaturalBench, 69.5% on D3, and 37.6% on MMVP. Second, for scenarios with modest training data (~5k samples), we show that a lightweight neural classifier can effectively select CIs from similar images at inference time, improving NaturalBench performance by up to 36.8%. Third, for scenarios with no training data, we develop a caption-matching technique that selects CIs by comparing LVLM-generated descriptions of candidate images. Notably, on VQAv2, our method improves VQA performance even in pointwise evaluation settings without explicit contrast images. Our approach demonstrates the potential for enhancing LVLMs at inference time through different CI selection approaches, each suited to different data availability scenarios.
Anthology ID:
2025.magmar-1.6
Volume:
Proceedings of the 1st Workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR 2025)
Month:
August
Year:
2025
Address:
Vienna, Austria
Editors:
Reno Kriz, Kenton Murray
Venues:
MAGMaR | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
65–78
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.magmar-1.6/
DOI:
Bibkey:
Cite (ACL):
Avshalom Manevich and Reut Tsarfaty. 2025. Making LVLMs Look Twice: Contrastive Decoding with Contrast Images. In Proceedings of the 1st Workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR 2025), pages 65–78, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Making LVLMs Look Twice: Contrastive Decoding with Contrast Images (Manevich & Tsarfaty, MAGMaR 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.magmar-1.6.pdf