Yubo Jiang
2026
Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation
Yubo Jiang | Xin Yang | Abudukelimu Wuerkaixi | Zheming Yuan | Xuxin Cheng | Cao Liu | Ke Zeng | Fengying Xie | Zhiguo Jiang | Haopeng Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Yubo Jiang | Xin Yang | Abudukelimu Wuerkaixi | Zheming Yuan | Xuxin Cheng | Cao Liu | Ke Zeng | Fengying Xie | Zhiguo Jiang | Haopeng Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Large vision–language models (LVLMs) excel at multimodal reasoning but still suffer from object-existence hallucinations when multi-step deliberation decouples from visual evidence. Think-with-Images (TwI) attempts to counter this by generating auxiliary observations (e.g., zoomed crops or highlighted views), yet it is not reliably beneficial. We identify two coupled failure modes: (1) a granularity–context trade-off of common operators, where zoom-in improves local detail but breaks global relations, while highlighting preserves topology but lacks fine evidence; and (2) an over-trust issue in tool-guided region proposals, where mislocalized evidence can dominate reasoning and even underperform standard prompting. We propose Active-Look, a training-free, plug-and-play TwI framework that allocates visual computation by uncertainty. Active-Look runs two heterogeneous grounding experts in parallel and uses their disagreement as a proxy for uncertainty, spending the budget only to verify disputed regions. It further mitigates the operator trade-off with conflict-aware hybrid rendering: highlighting retains global context, while selective zoom-in performs local verification. Experiments on hallucination-focused and general benchmarks (POPE, MME, and CHAIR) across multiple LVLM backbones show consistent gains.