Ko Koga
2025
VisTRA: Visual Tool-use Reasoning Analyzer for Small Object Visual Question Answering
Hiroaki Sugiyama
|
Ko Koga
|
Toshifumi Nishijima
Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025)
This study proposes VisTRA (Visual Tool-use Reasoning Analyzer), a framework for analyzing how Visual Language Models (VLMs) utilize tools in VQA tasks involving small objects in high-resolution images. While tools like object detection and zoom functionality are essential for small object VQA, their potential errors necessitate careful verification of outputs. Our framework provides systematic evaluation of VLMs’ tool-use capabilities through analysis of verification patterns. Using the V* bench dataset, we find that direct acceptance of tool outputs correlates with decreased VQA accuracy, while lower-performing models exhibit higher frequencies of cyclic verification loops. These findings offer insights for improving tool verification mechanisms in VLM architectures focused on small object detection tasks.