Defining and Quantifying Visual Hallucinations in Vision-Language Models

Vipula Rawte, Aryan Mishra, Amit Sheth, Amitava Das


Abstract
The troubling rise of hallucination presents perhaps the most significant impediment to the advancement of responsible AI. In recent times, considerable research has focused on detecting and mitigating hallucination in Large Language Models (LLMs). However, it’s worth noting that hallucination is also quite prevalent in Vision-Language models (VLMs). In this paper, we offer a fine-grained discourse on profiling VLM hallucination based on the image captioning task. We delineate eight fine-grained orientations of visual hallucination: i) Contextual Guessing, ii) Identity Incongruity, iii) Geographical Erratum, iv) Visual Illusion, v) Gender Anomaly, vi) VLM as Classifier, vii) Wrong Reading, and viii) Numeric Discrepancy. We curate Visual HallucInation eLiciTation, a publicly available dataset comprising 2,000 samples generated using eight VLMs across the image captioning task, along with human annotations for the categories as mentioned earlier. To establish a method for quantification and to offer a comparative framework enabling the evaluation and ranking of VLMs according to their vulnerability to producing hallucinations, we propose the Visual Hallucination Vulnerability Index (VHVI). In summary, we introduce the VHILT dataset for image-to-text hallucinations and propose the VHVI metric to quantify hallucinations in VLMs, targeting specific visual hallucination types. A subset sample is available at: https://huggingface.co/datasets/vr25/vhil. The full dataset will be publicly released upon acceptance.
Anthology ID:
2025.trustnlp-main.32
Volume:
Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)
Month:
May
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Trista Cao, Anubrata Das, Tharindu Kumarage, Yixin Wan, Satyapriya Krishna, Ninareh Mehrabi, Jwala Dhamala, Anil Ramakrishna, Aram Galystan, Anoop Kumar, Rahul Gupta, Kai-Wei Chang
Venues:
TrustNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
501–510
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.trustnlp-main.32/
DOI:
Bibkey:
Cite (ACL):
Vipula Rawte, Aryan Mishra, Amit Sheth, and Amitava Das. 2025. Defining and Quantifying Visual Hallucinations in Vision-Language Models. In Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025), pages 501–510, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Defining and Quantifying Visual Hallucinations in Vision-Language Models (Rawte et al., TrustNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.trustnlp-main.32.pdf