ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use

Mengjie Deng, Guanting Dong, Zhicheng Dou


Abstract
Recently, large language models (LLMs) have demonstrated remarkable problem-solving capabilities by autonomously integrating with external tools for collaborative reasoning. However, due to the inherently complex and diverse nature of multimodal information, enabling multimodal large language models (MLLMs) to flexibly and efficiently utilize external tools during reasoning remains an underexplored challenge. In this work, we introduce ToolScope, an agentic framework designed to unify global planning with local multimodal perception, adopting a specialized Perceive tool to mitigates visual context degradation in long-horizon VQA task. ToolScope comprises three primary components: the Global Navigator, the Agentic Executor, and the Response Synthesizer. The Global Navigator functions as a "telescope”, offering high-level strategic guidance. The Agentic Executor operates iteratively to augment MLLM with local perception through the integration of external tools—Search, Code, and Perceive. Finally, the Response Synthesizer consolidates and organizes the reasoning process into a coherent, user-friendly output. We evaluate ToolScope on four VQA benchmarks across diverse domains, including VQA 2.0, ScienceQA, MAT-Search and MathVista. It demonstrates strong generalization capabilities, achieving an average performance improvement of up to +6.69% across all datasets. Our code is available at https://github.com/dengmengjie/ToolScope.
Anthology ID:
2026.findings-acl.11
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
211–225
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.11/
DOI:
Bibkey:
Cite (ACL):
Mengjie Deng, Guanting Dong, and Zhicheng Dou. 2026. ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use. In Findings of the Association for Computational Linguistics: ACL 2026, pages 211–225, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use (Deng et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.11.pdf
Checklist:
 2026.findings-acl.11.checklist.pdf