Can Vision-Language Models Infer Speaker’s Ignorance? The Role of Visual and Linguistic Cues

Ye-eun Cho; Yunho Maeng

Can Vision-Language Models Infer Speaker’s Ignorance? The Role of Visual and Linguistic Cues

Abstract

This study investigates whether vision-language models (VLMs) can perform pragmatic inference, focusing on ignorance implicatures, utterances that imply the speaker’s lack of precise knowledge. To test this, we systematically manipulated contextual cues: the visually depicted situation (visual cue) and QUD-based linguistic prompts (linguistic cue). When only visual cues were provided, three state-of-the-art VLMs (GPT-4o, Gemini 1.5 Pro, and Claude 3.5 sonnet) produced interpretations largely based on the lexical meaning of the modified numerals. When linguistic cues were added to enhance contextual informativeness, Claude exhibited more human-like inference by integrating both types of contextual cues. In contrast, GPT and Gemini favored precise, literal interpretations. Although the influence of contextual cues increased, they treated each contextual cue independently and aligned them with semantic features rather than engaging in context-driven reasoning. These findings suggest that although the models differ in how they handle contextual cues, Claude’s ability to combine multiple cues may signal emerging pragmatic competence in multimodal models.

Anthology ID:: 2025.uncertainlp-main.25
Volume:: Proceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025)
Month:: November
Year:: 2025
Address:: Suzhou, China
Editor:: Noidea Noidea
Venues:: UncertaiNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 298–308
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.uncertainlp-main.25/
DOI:
Bibkey:
Cite (ACL):: Ye-eun Cho and Yunho Maeng. 2025. Can Vision-Language Models Infer Speaker’s Ignorance? The Role of Visual and Linguistic Cues. In Proceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025), pages 298–308, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Can Vision-Language Models Infer Speaker’s Ignorance? The Role of Visual and Linguistic Cues (Cho & Maeng, UncertaiNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.uncertainlp-main.25.pdf

PDF Cite Search Fix data