Jiawei Cao

Other people with similar names: Jiawei Cao

Unverified author pages with similar names: Jiawei Cao


2026

Multimodal large language models (MLLMs) are increasingly deployed in Web-scale applications—such as image search, social media captioning, and e-commerce product description generation—where factual consistency is critical for user trust and content reliability. However, we observe that MLLMs frequently hallucinate in these settings due to two relevant phenomena: the massive activation phenomenon and positional information decay. The former refers to the tendency of attention mechanisms to concentrate on a small set of tokens with extreme activation values in query and key projections, which play indispensable roles in contextual understanding. In our investigation, we discover that perturbing these tokens leads to significant performance drops, highlighting their utmost importance. As for positional information decay, it occurs due to the common rotary position encoding strategy, where the attention to early visual tokens diminishes over time, especially in long-sequence generation tasks, such as image caption. To address these challenges, we propose TokenTruth, a token-level intervention strategy that dynamically suppresses irrelevant visual tokens while preserving key contextual signals. Our method is grounded in an in-depth analysis of massive activations and attention sink behaviors, and introduces a targeted token penalty mechanism that reallocates attention more faithfully toward informative visual regions. Extensive experiments demonstrate that TokenTruth significantly improves factual consistency across various MLLMs on standard image understanding benchmarks.

2025

Multimodal large language models (MLLMs) demonstrate excellent abilities for understanding visual information, while the hallucination remains. Albeit image tokens constitute the majority of the MLLMs input, the relation between image tokens and hallucinations is still unexplored. In this paper, we analyze the attention score distribution of image tokens across layers and attention heads in models, revealing an intriguing but common phenomenon: most hallucinations are closely linked to the attention sink patterns of image tokens attention matrix, where shallow layers exhibit dense sinks and deep layers exhibit the sparse. We further explore the attention heads of different layers, finding: heads with high-density attention sink of the image part act positively in mitigating hallucinations. Inspired by these findings, we propose a training-free approach called Enhancing Vision Attention Sinks (EVAS) to facilitate the convergence of the image token attention sink within shallow layers. Specifically, EVAS identifies the attention heads that emerge as the densest visual sink in shallow layers and extracts its attention matrix, which is then broadcast to other heads of the same layer, thereby strengthing the layer’s focus on the image itself. Extensive empirical results of various MLLMs illustrate the superior performance of the proposed EVAS, demonstrating its effectiveness and generality.