Jiawei Cao
Other people with similar names: Jiawei Cao
Unverified author pages with similar names: Jiawei Cao
2026
TokenPenalty: Alleviating Attention Sinks and Positional Decay in LVLMs
Xiaofeng Zhang | Yuanchao Zhu | Qiyan Zhao | Xiaosong Yuan | Jiawei Cao | Xuhang Chen
Findings of the Association for Computational Linguistics: ACL 2026
Xiaofeng Zhang | Yuanchao Zhu | Qiyan Zhao | Xiaosong Yuan | Jiawei Cao | Xuhang Chen
Findings of the Association for Computational Linguistics: ACL 2026
Multimodal large language models (MLLMs) are increasingly deployed in Web-scale applications—such as image search, social media captioning, and e-commerce product description generation—where factual consistency is critical for user trust and content reliability. However, we observe that MLLMs frequently hallucinate in these settings due to two relevant phenomena: the massive activation phenomenon and positional information decay. The former refers to the tendency of attention mechanisms to concentrate on a small set of tokens with extreme activation values in query and key projections, which play indispensable roles in contextual understanding. In our investigation, we discover that perturbing these tokens leads to significant performance drops, highlighting their utmost importance. As for positional information decay, it occurs due to the common rotary position encoding strategy, where the attention to early visual tokens diminishes over time, especially in long-sequence generation tasks, such as image caption. To address these challenges, we propose TokenTruth, a token-level intervention strategy that dynamically suppresses irrelevant visual tokens while preserving key contextual signals. Our method is grounded in an in-depth analysis of massive activations and attention sink behaviors, and introduces a targeted token penalty mechanism that reallocates attention more faithfully toward informative visual regions. Extensive experiments demonstrate that TokenTruth significantly improves factual consistency across various MLLMs on standard image understanding benchmarks.
2025
Shallow Focus, Deep Fixes: Enhancing Shallow Layers Vision Attention Sinks to Alleviate Hallucination in LVLMs
Xiaofeng Zhang | Yihao Quan | Chen Shen | Chaochen Gu | Xiaosong Yuan | Shaotian Yan | Jiawei Cao | Hao Cheng | Kaijie Wu | Jieping Ye
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Xiaofeng Zhang | Yihao Quan | Chen Shen | Chaochen Gu | Xiaosong Yuan | Shaotian Yan | Jiawei Cao | Hao Cheng | Kaijie Wu | Jieping Ye
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Multimodal large language models (MLLMs) demonstrate excellent abilities for understanding visual information, while the hallucination remains. Albeit image tokens constitute the majority of the MLLMs input, the relation between image tokens and hallucinations is still unexplored. In this paper, we analyze the attention score distribution of image tokens across layers and attention heads in models, revealing an intriguing but common phenomenon: most hallucinations are closely linked to the attention sink patterns of image tokens attention matrix, where shallow layers exhibit dense sinks and deep layers exhibit the sparse. We further explore the attention heads of different layers, finding: heads with high-density attention sink of the image part act positively in mitigating hallucinations. Inspired by these findings, we propose a training-free approach called Enhancing Vision Attention Sinks (EVAS) to facilitate the convergence of the image token attention sink within shallow layers. Specifically, EVAS identifies the attention heads that emerge as the densest visual sink in shallow layers and extracts its attention matrix, which is then broadcast to other heads of the same layer, thereby strengthing the layer’s focus on the image itself. Extensive empirical results of various MLLMs illustrate the superior performance of the proposed EVAS, demonstrating its effectiveness and generality.