Unveiling Inherent Visual Grounding in Multimodal LLMs for Text-Rich Images

Shijie Zhou; Jihyung Kil; Ming Li; Jiuxiang Gu; Curtis Wigington; Rajiv Jain; Changyou Chen; Ruiyi Zhang

Unveiling Inherent Visual Grounding in Multimodal LLMs for Text-Rich Images

Shijie Zhou, Jihyung Kil, Ming Li, Jiuxiang Gu, Curtis Wigington, Rajiv Jain, Changyou Chen, Ruiyi Zhang

Abstract

Visual text grounding provides interpretable evidence for document question answering. Due to the complex layouts and mixed visual-text contents in text-rich images, effective visual text grounding requires strong visual and spatial reasoning to localize multiple referenced regions. Existing multimodal large language model (MLLM) approaches often struggle to align query tokens with visual–text patches, heavily relying on lengthy OCR inputs. To tackle this problem, we propose Doc-AGround, an OCR-free approach that leverages the MLLM’s inherent multi-head attention for multi-patch grounding. Doc-AGround extracts a patch-wise attention map as the grounding prediction. Concurrently, it introduces an effective multi-head weighting mechanism to amplify the attention heads’ intrinsic role in connecting vision and text. Empirical results of Doc-AGround show state-of-the-art performance on challenging document grounding benchmarks, demonstrating the effectiveness of the proposed attention-based grounding design.

Anthology ID:: 2026.findings-acl.16
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 352–370
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.16/
DOI:
Bibkey:
Cite (ACL):: Shijie Zhou, Jihyung Kil, Ming Li, Jiuxiang Gu, Curtis Wigington, Rajiv Jain, Changyou Chen, and Ruiyi Zhang. 2026. Unveiling Inherent Visual Grounding in Multimodal LLMs for Text-Rich Images. In Findings of the Association for Computational Linguistics: ACL 2026, pages 352–370, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Unveiling Inherent Visual Grounding in Multimodal LLMs for Text-Rich Images (Zhou et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.16.pdf
Checklist:: 2026.findings-acl.16.checklist.pdf

PDF Cite Search Checklist Fix data