VisTRA: Visual Tool-use Reasoning Analyzer for Small Object Visual Question Answering

Hiroaki Sugiyama, Ko Koga, Toshifumi Nishijima


Abstract
This study proposes VisTRA (Visual Tool-use Reasoning Analyzer), a framework for analyzing how Visual Language Models (VLMs) utilize tools in VQA tasks involving small objects in high-resolution images. While tools like object detection and zoom functionality are essential for small object VQA, their potential errors necessitate careful verification of outputs. Our framework provides systematic evaluation of VLMs’ tool-use capabilities through analysis of verification patterns. Using the V* bench dataset, we find that direct acceptance of tool outputs correlates with decreased VQA accuracy, while lower-performing models exhibit higher frequencies of cyclic verification loops. These findings offer insights for improving tool verification mechanisms in VLM architectures focused on small object detection tasks.
Anthology ID:
2025.realm-1.26
Volume:
Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Ehsan Kamalloo, Nicolas Gontier, Xing Han Lu, Nouha Dziri, Shikhar Murty, Alexandre Lacoste
Venues:
REALM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
356–366
Language:
URL:
https://preview.aclanthology.org/acl25-workshop-ingestion/2025.realm-1.26/
DOI:
Bibkey:
Cite (ACL):
Hiroaki Sugiyama, Ko Koga, and Toshifumi Nishijima. 2025. VisTRA: Visual Tool-use Reasoning Analyzer for Small Object Visual Question Answering. In Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025), pages 356–366, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
VisTRA: Visual Tool-use Reasoning Analyzer for Small Object Visual Question Answering (Sugiyama et al., REALM 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/acl25-workshop-ingestion/2025.realm-1.26.pdf