Beyond Logit Lens: Contextual Embeddings for Robust Hallucination Detection & Grounding in VLMs

Anirudh Phukan, Divyansh Divyansh, Harshit Kumar Morj, Vaishnavi Vaishnavi, Apoorv Saxena, Koustava Goswami


Abstract
The rapid development of Large Multimodal Models (LMMs) has significantly advanced multimodal understanding by harnessing the language abilities of Large Language Models (LLMs) and integrating modality-specific encoders. However, LMMs are plagued by hallucinations that limit their reliability and adoption. While traditional methods to detect and mitigate these hallucinations often involve costly training or rely heavily on external models, recent approaches utilizing internal model features present a promising alternative. In this paper, we critically assess the limitations of the state-of-the-art training-free technique, the logit lens, in handling generalized visual hallucinations. We introduce *ContextualLens*, a refined method that leverages contextual token embeddings from middle layers of LMMs. This approach significantly improves hallucination detection and grounding across diverse categories, including actions and OCR, while also excelling in tasks requiring contextual understanding, such as spatial relations and attribute comparison. Our novel grounding technique yields highly precise bounding boxes, facilitating a transition from Zero-Shot Object Segmentation to Grounded Visual Question Answering. Our contributions pave the way for more reliable and interpretable multimodal models.
Anthology ID:
2025.naacl-long.488
Volume:
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9661–9675
Language:
URL:
https://preview.aclanthology.org/Author-page-Marten-During-lu/2025.naacl-long.488/
DOI:
Bibkey:
Cite (ACL):
Anirudh Phukan, Divyansh Divyansh, Harshit Kumar Morj, Vaishnavi Vaishnavi, Apoorv Saxena, and Koustava Goswami. 2025. Beyond Logit Lens: Contextual Embeddings for Robust Hallucination Detection & Grounding in VLMs. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 9661–9675, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Beyond Logit Lens: Contextual Embeddings for Robust Hallucination Detection & Grounding in VLMs (Phukan et al., NAACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/Author-page-Marten-During-lu/2025.naacl-long.488.pdf