VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought

Eunsoo Lee, Jeongwoo Lee, Minki Hong, Jangho Choi, Jihie Kim


Abstract
Large vision-language models (LVLMs) struggle to reliably detect visual primitives in charts and align them with semantic representations, which severely limits their performance on complex visual reasoning. This lack of perceptual grounding constitutes a major bottleneck for chart-based reasoning. We propose VisDoT, a framework that enhances visual reasoning through human-like interpretation grounding. We formalize four perceptual tasks based on the theory of graphical perception such as position and length. Building on this foundation, we introduce decomposition-of-thought (DoT) prompting, which sequentially separates questions into visual perception sub-questions and logic sub-questions. Fine-tuning InternVL with VisDoT achieves a +11.2% improvement on ChartQA and surpasses GPT-4o on the more challenging ChartQAPro benchmark. On the newly introduced VisDoTQA benchmark, the model improves by +33.2%. Furthermore, consistent zero-shot gains on diverse open-domain VQA benchmarks confirm the generalizability of the perception-logic separation strategy for visual question answering in general. VisDoT leverages human-like perception to enhance visual grounding, achieving state-of-the-art chart understanding and interpretable visual reasoning.
Anthology ID:
2026.findings-eacl.30
Volume:
Findings of the Association for Computational Linguistics: EACL 2026
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
610–640
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.30/
DOI:
Bibkey:
Cite (ACL):
Eunsoo Lee, Jeongwoo Lee, Minki Hong, Jangho Choi, and Jihie Kim. 2026. VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought. In Findings of the Association for Computational Linguistics: EACL 2026, pages 610–640, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought (Lee et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.30.pdf
Checklist:
 2026.findings-eacl.30.checklist.pdf