Unveiling the Response of Large Vision-Language Models to Visually Absent Tokens

Sohee Kim; Soohyun Ryu; Joonhyung Park; Eunho Yang

Unveiling the Response of Large Vision-Language Models to Visually Absent Tokens

Sohee Kim, Soohyun Ryu, Joonhyung Park, Eunho Yang

Abstract

Large Vision-Language Models (LVLMs) generate contextually relevant responses by jointly interpreting visual and textual inputs. However, our finding reveals they often mistakenly perceive text inputs lacking visual evidence as being part of the image, leading to erroneous responses. In light of this finding, we probe whether LVLMs possess an internal capability to determine if textual concepts are grounded in the image, and discover a specific subset of Feed-Forward Network (FFN) neurons, termed Visual Absence-aware (VA) neurons, that consistently signal the visual absence through a distinctive activation pattern. Leveraging these patterns, we develop a detection module that systematically classifies whether an input token is visually grounded. Guided by its prediction, we propose a method to refine the outputs by reinterpreting question prompts or replacing the detected absent tokens during generation. Extensive experiments show that our method effectively mitigates the models’ tendency to falsely presume the visual presence of text input and its generality across various LVLMs.

Anthology ID:: 2025.emnlp-main.1092
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 21546–21568
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1092/
DOI:
Bibkey:
Cite (ACL):: Sohee Kim, Soohyun Ryu, Joonhyung Park, and Eunho Yang. 2025. Unveiling the Response of Large Vision-Language Models to Visually Absent Tokens. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 21546–21568, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Unveiling the Response of Large Vision-Language Models to Visually Absent Tokens (Kim et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1092.pdf
Checklist:: 2025.emnlp-main.1092.checklist.pdf

PDF Cite Search Checklist Fix data