Unveiling Inherent Visual Grounding in Multimodal LLMs for Text-Rich Images
Shijie Zhou, Jihyung Kil, Ming Li, Jiuxiang Gu, Curtis Wigington, Rajiv Jain, Changyou Chen, Ruiyi Zhang
Abstract
Visual text grounding provides interpretable evidence for document question answering. Due to the complex layouts and mixed visual-text contents in text-rich images, effective visual text grounding requires strong visual and spatial reasoning to localize multiple referenced regions. Existing multimodal large language model (MLLM) approaches often struggle to align query tokens with visual–text patches, heavily relying on lengthy OCR inputs. To tackle this problem, we propose Doc-AGround, an OCR-free approach that leverages the MLLM’s inherent multi-head attention for multi-patch grounding. Doc-AGround extracts a patch-wise attention map as the grounding prediction. Concurrently, it introduces an effective multi-head weighting mechanism to amplify the attention heads’ intrinsic role in connecting vision and text. Empirical results of Doc-AGround show state-of-the-art performance on challenging document grounding benchmarks, demonstrating the effectiveness of the proposed attention-based grounding design.- Anthology ID:
- 2026.findings-acl.16
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 352–370
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.16/
- DOI:
- Cite (ACL):
- Shijie Zhou, Jihyung Kil, Ming Li, Jiuxiang Gu, Curtis Wigington, Rajiv Jain, Changyou Chen, and Ruiyi Zhang. 2026. Unveiling Inherent Visual Grounding in Multimodal LLMs for Text-Rich Images. In Findings of the Association for Computational Linguistics: ACL 2026, pages 352–370, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Unveiling Inherent Visual Grounding in Multimodal LLMs for Text-Rich Images (Zhou et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.16.pdf