Mitigating Hallucinations in Multi-modal Large Language Models via Image Token Attention-Guided Decoding

Xinhao Xu; Hui Chen; Mengyao Lyu; Sicheng Zhao; Yizhe Xiong; Zijia Lin; Jungong Han; Guiguang Ding

Mitigating Hallucinations in Multi-modal Large Language Models via Image Token Attention-Guided Decoding

Xinhao Xu, Hui Chen, Mengyao Lyu, Sicheng Zhao, Yizhe Xiong, Zijia Lin, Jungong Han, Guiguang Ding

Abstract

Multi-modal large language models (MLLMs) integrate the inherent text generation capabilities of large language models with an understanding of other modalities, promising wide applications in open-ended tasks. Despite their success, they often generate plausible but incorrect content. This phenomenon, known as hallucination, significantly impacts their practical deployment. In this paper, we delve into the intrinsic characteristics of hallucination from the perspective of interaction between input and output tokens. We find that the hallucination typically occurs with attention reduction of output tokens to image tokens. Based on this observation, we introduce image Token attention-guided Decoding (iTaD), a plug-and-play method which leverages MLLMs’ internal representations to mitigate their hallucinations. We first define an image token attention vector to measure the inter-layer differences in attention of output tokens to image tokens across different layers. Based on the vector, we design a novel layer selection strategy and conduct inter-layer contrastive decoding to highlight the progression in image understanding, thereby exploiting attention to image tokens to mitigate hallucinations. Extensive experiments well demonstrate iTaD’s effectiveness across different MLLMs and benchmarks.

Anthology ID:: 2025.naacl-long.75
Volume:: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1571–1590
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2025.naacl-long.75/
DOI:
Bibkey:
Cite (ACL):: Xinhao Xu, Hui Chen, Mengyao Lyu, Sicheng Zhao, Yizhe Xiong, Zijia Lin, Jungong Han, and Guiguang Ding. 2025. Mitigating Hallucinations in Multi-modal Large Language Models via Image Token Attention-Guided Decoding. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1571–1590, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: Mitigating Hallucinations in Multi-modal Large Language Models via Image Token Attention-Guided Decoding (Xu et al., NAACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2025.naacl-long.75.pdf

PDF Cite Search Fix data