Haochen Zhao

2026

The Halo Effect and Language Takeover: Spatiotemporal Attention Decay Explains Vision-Language Model Failures in Simple Visual Counting
Haochen Zhao | Sujian Li
Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)

Despite their remarkable capabilities in complex multimodal reasoning, Vision Language Models (VLMs) exhibit a perplexing inability to perform elementary visual counting tasks reliably. Existing hypotheses, often centering on input resolution or patch tokenization, fail to fully explain the stochastic nature of these errors, particularly in multi-digit generation. In this work, we investigate the internal decision-making dynamics of VLMs (e.g., Qwen3-VL, Gemma3) through the lens of attention mechanisms. By leveraging a controlled synthetic dataset and introducing novel metrics for Visual Sparsity and Entropy, we discover a novel phenomenon: Spatiotemporal Attention Decay. Our analysis reveals two distinct failure modes. Spatially, models exhibit a Halo Effect, where attention focuses on the peripheral convex hull of object clusters rather than penetrating the geometric centers of individual instances. Temporally, we observe a phenomenon of Language Takeover: during auto-regressive decoding, visual grounding decays rapidly after the initial token. Quantitative analysis confirms that as attention sparsity drops and entropy rises, the generation of subsequent digits degenerates from visual perception into hallucination driven by language priors. These findings suggest that counting failures stem from the model’s inability to maintain spatiotemporal focus, highlighting the need for mechanisms that enforce persistent visual grounding.

Co-authors

Sujian Li (李素建) 1

Venues

TrustNLP1
WS1

Fix author