Qiyan Zhao

2026

Multimodal large language models (MLLMs) are increasingly deployed in Web-scale applications—such as image search, social media captioning, and e-commerce product description generation—where factual consistency is critical for user trust and content reliability. However, we observe that MLLMs frequently hallucinate in these settings due to two relevant phenomena: the massive activation phenomenon and positional information decay. The former refers to the tendency of attention mechanisms to concentrate on a small set of tokens with extreme activation values in query and key projections, which play indispensable roles in contextual understanding. In our investigation, we discover that perturbing these tokens leads to significant performance drops, highlighting their utmost importance. As for positional information decay, it occurs due to the common rotary position encoding strategy, where the attention to early visual tokens diminishes over time, especially in long-sequence generation tasks, such as image caption. To address these challenges, we propose TokenTruth, a token-level intervention strategy that dynamically suppresses irrelevant visual tokens while preserving key contextual signals. Our method is grounded in an in-depth analysis of massive activations and attention sink behaviors, and introduces a targeted token penalty mechanism that reallocates attention more faithfully toward informative visual regions. Extensive experiments demonstrate that TokenTruth significantly improves factual consistency across various MLLMs on standard image understanding benchmarks.

pdf bib abs

Fixing Semantic Blind Spots in Anchor Tokens of dMLLMs
Ruixuan Xu | Jiexi Xu | Qiyan Zhao | Xiaofeng Zhang
Findings of the Association for Computational Linguistics: ACL 2026

Recent advances in diffusion-based Multimodal Large Language Models (dMLLMs) offer a compelling alternative to autoregressive counterparts; however, they remain prone to hallucinations. Through information flow analysis on LLaDA-V, we identify two intertwined factors contributing to this issue. First, although the special tokens serve as semantic anchors for aggregating visual information, they simultaneously induce severe attention sinks, excessively consuming the model’s attention budget. Second, the long-range decay inherent in Rotary Position Embedding (RoPE) leads to semantic blind spots, preventing these anchors from uniformly perceiving the entire visual input. Accordingly, our objective is to moderately alleviate the attention sink effect on semantic anchors while enhancing their ability to aggregate global visual information, thereby eliminating semantic blind spots. To this end, we propose Extrinsic Distance-Aware Regularization (EDAR), a training-free decoding strategy that augments the attention key space with a static, distance-aware matrix. This matrix jointly redistributes excessive attention away from anchors and injects absolute positional bias to ensure uniform visual coverage. Experiments on LLaDA-V demonstrate that EDAR effectively eliminates semantic blind spots and achieves state-of-the-art performance on both hallucination-specific and general multimodal benchmarks.

Co-authors

Xiaosong Yuan 1

Yuanchao Zhu 1

Venues

Findings2

Fix author