VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought

Byeonggeuk Lim, Kyeonghyun Kim, Jungmin Yun, Youngbin Kim


Abstract
The advancement of Large Vision-Language Models (LVLMs) requires precise local region-based reasoning that faithfully grounds the model’s logic in actual visual evidence. However, existing datasets face limitations in scalability due to extensive manual annotation and lack explicit alignment between multi-step reasoning and corresponding image regions, which constrains the evaluation of model trustworthiness. To address these challenges, we propose the Visual Grounding Chain-of-Thought (VG-CoT) dataset, which explicitly links each reasoning step to real visual evidence within the image through a fully automated three-stage pipeline. The pipeline first extracts object- and text-level visual evidence using state-of-the-art detection and OCR models, then generates step-by-step grounded reasoning with GPT-4o, and finally refines the grounding through a rationale-driven open-set detection process. In addition, we introduce a new benchmark that comprehensively evaluates LVLMs reasoning across three complementary dimensions: Rationale Quality, Answer Accuracy, and Reasoning–Answer Alignment. Experiments with representative LVLMs, including LLaVA-1.5 and Qwen2-VL, demonstrate consistent improvements across all evaluation metrics, confirming that VG-CoT effectively enhances trustworthy, evidence-based reasoning while maintaining scalable and cost-efficient dataset construction. The dataset and code will be released publicly upon acceptance to facilitate further research.
Anthology ID:
2026.lrec-main.867
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
11108–11118
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.867/
DOI:
Bibkey:
Cite (ACL):
Byeonggeuk Lim, Kyeonghyun Kim, Jungmin Yun, and Youngbin Kim. 2026. VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought. International Conference on Language Resources and Evaluation, main:11108–11118.
Cite (Informal):
VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought (Lim et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.867.pdf