Simple-VGC: Enhancing Visual Grounding in Multimodal Reasoning via Adaptive Tool Composition

Ye Wang; Qianglong Chen; Siyuan Wang (王思远); Zejun Li (李泽君); Shijie Guo; Zhirui Zhang; Zhongyu Wei (魏忠钰)

Simple-VGC: Enhancing Visual Grounding in Multimodal Reasoning via Adaptive Tool Composition

Ye Wang, Qianglong Chen, Siyuan Wang, Zejun Li, Shijie Guo, Zhirui Zhang, Zhongyu Wei

Abstract

Multimodal Large Language Models (MLLMs) have achieved strong performance on vision-language tasks, yet often fail to preserve and effectively leverage visual evidence throughout generation. We identify three fundamental types of visual grounding failures: Long-Context Grounding Error, where visual information gradually decays over long sequences; Fine-Grained Grounding Error, where low-resolution or degraded inputs hinder the recovery of detailed visual information; and Regional Grounding Error, where spatially diffuse attention weakens region-level vision-language alignment. To address these issues, we propose a tool-augmented reasoning framework with three targeted compensation strategies: reuse, which re-injects the original image to mitigate visual forgetting; focus_area, which constrains attention to task-relevant regions; and zoom_in, which enhances visual resolution for fine-grained perception. We further construct the TWI-Tools-146K dataset and develop Simple-VGC, a tool-augmented MLLM that interleaves visual and textual tokens. Extensive experiments show that each tool yields targeted improvements for its corresponding grounding error, while their combination produces synergistic gains in visual reasoning. Beyond performance, our analysis provides mechanistic insights into how tool-based interventions improve visual grounding, pointing toward more reliable multimodal reasoning.

Anthology ID:: 2026.acl-long.223
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4903–4929
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.223/
DOI:
Bibkey:
Cite (ACL):: Ye Wang, Qianglong Chen, Siyuan Wang, Zejun Li, Shijie Guo, Zhirui Zhang, and Zhongyu Wei. 2026. Simple-VGC: Enhancing Visual Grounding in Multimodal Reasoning via Adaptive Tool Composition. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4903–4929, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Simple-VGC: Enhancing Visual Grounding in Multimodal Reasoning via Adaptive Tool Composition (Wang et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.223.pdf
Checklist:: 2026.acl-long.223.checklist.pdf

PDF Cite Search Checklist Fix data