Curtis Wigington
2026
Unveiling Inherent Visual Grounding in Multimodal LLMs for Text-Rich Images
Shijie Zhou | Jihyung Kil | Ming Li | Jiuxiang Gu | Curtis Wigington | Rajiv Jain | Changyou Chen | Ruiyi Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Shijie Zhou | Jihyung Kil | Ming Li | Jiuxiang Gu | Curtis Wigington | Rajiv Jain | Changyou Chen | Ruiyi Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Visual text grounding provides interpretable evidence for document question answering. Due to the complex layouts and mixed visual-text contents in text-rich images, effective visual text grounding requires strong visual and spatial reasoning to localize multiple referenced regions. Existing multimodal large language model (MLLM) approaches often struggle to align query tokens with visual–text patches, heavily relying on lengthy OCR inputs. To tackle this problem, we propose Doc-AGround, an OCR-free approach that leverages the MLLM’s inherent multi-head attention for multi-patch grounding. Doc-AGround extracts a patch-wise attention map as the grounding prediction. Concurrently, it introduces an effective multi-head weighting mechanism to amplify the attention heads’ intrinsic role in connecting vision and text. Empirical results of Doc-AGround show state-of-the-art performance on challenging document grounding benchmarks, demonstrating the effectiveness of the proposed attention-based grounding design.
2022
TELIN: Table Entity LINker for Extracting Leaderboards from Machine Learning Publications
Sean Yang | Chris Tensmeyer | Curtis Wigington
Proceedings of the First Workshop on Information Extraction from Scientific Publications
Sean Yang | Chris Tensmeyer | Curtis Wigington
Proceedings of the First Workshop on Information Extraction from Scientific Publications
Tracking state-of-the-art (SOTA) results in machine learning studies is challenging due to high publication volume. Existing methods for creating leaderboards in scientific documents require significant human supervision or rely on scarcely available LaTeX source files. We propose Table Entity LINker (TELIN), a framework which extracts (task, model, dataset, metric) quadruples from collections of scientific publications in PDF format. TELIN identifies scientific named entities, constructs a knowledge base, and leverages human feedback to iteratively refine automatic extractions. TELIN identifies and prioritizes uncertain and impactful entities for human review to create a cascade effect for leaderboard completion. We show that TELIN is competitive with the SOTA but requires much less human annotation.