Analyzing the Sensitivity of Vision Language Models in Visual Question Answering

Monika Shah; Sudarshan Balaji; Somdeb Sarkhel; Sanorita Dey; Deepak Venugopal

Analyzing the Sensitivity of Vision Language Models in Visual Question Answering

Monika Shah, Sudarshan Balaji, Somdeb Sarkhel, Sanorita Dey, Deepak Venugopal

Abstract

We can think of Visual Question Answering as a (multimodal) conversation between a human and an AI system. Here, we explore the sensitivity of Vision Language Models (VLMs) through the lens of cooperative principles of conversation proposed by Grice. Specifically, even when Grice’s maxims of conversation are flouted, humans typically do not have much difficulty in understanding the conversation even though it requires more cognitive effort. Here, we study if VLMs are capable of handling violations to Grice’s maxims in a manner that is similar to humans. Specifically, we add modifiers to human-crafted questions and analyze the response of VLMs to these modifiers. We use three state-of-the-art VLMs in our study, namely, GPT-4o, Claude-3.5-Sonnet and Gemini-1.5-Flash on questions from the VQA v2.0 dataset. Our initial results seem to indicate that the performance of VLMs consistently diminish with the addition of modifiers which indicates our approach as a promising direction to understand the limitations of VLMs.

Anthology ID:: 2025.gem-1.36
Volume:: Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Month:: July
Year:: 2025
Address:: Vienna, Austria and virtual meeting
Editors:: Kaustubh Dhole, Miruna Clinciu
Venues:: GEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 431–438
Language:
URL:: https://preview.aclanthology.org/corrections-2025-08/2025.gem-1.36/
DOI:
Bibkey:
Cite (ACL):: Monika Shah, Sudarshan Balaji, Somdeb Sarkhel, Sanorita Dey, and Deepak Venugopal. 2025. Analyzing the Sensitivity of Vision Language Models in Visual Question Answering. In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²), pages 431–438, Vienna, Austria and virtual meeting. Association for Computational Linguistics.
Cite (Informal):: Analyzing the Sensitivity of Vision Language Models in Visual Question Answering (Shah et al., GEM 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/corrections-2025-08/2025.gem-1.36.pdf

PDF Cite Search Fix data