On the Fine-Grained Planning Abilities of VLM Web Agents

Surgan Jandial; Yinong Oliver Wang; Andrea Bajcsy; Fernando De la Torre

doi:10.18653/v1/2025.findings-emnlp.1382

On the Fine-Grained Planning Abilities of VLM Web Agents

Surgan Jandial, Yinong Oliver Wang, Andrea Bajcsy, Fernando De la Torre

Abstract

Vision-Language Models (VLMs) have shown promise as web agents, yet their planning—the ability to devise strategies or action sequences to complete tasks—remains understudied. While prior works focus on VLM’s perception and overall success rates (i.e., goal completion), fine-grained investigation of their planning has been overlooked. To address this gap, we examine VLMs’ capability to (1) understand temporal relationships within web contexts, and (2) assess plans of actions across diverse scenarios. We design four simple yet effective tests to delve into these nuanced aspects around planning. Our results across nineteen VLMs reveal that these models exhibit limited performance in the aforementioned skills and are not reliable to function as web agents. To facilitate future work, we release our planning evaluations and data, providing a foundation for advancing the future research in this area.

Anthology ID:: 2025.findings-emnlp.1382
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 25347–25380
Language:
URL:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1382/
DOI:: 10.18653/v1/2025.findings-emnlp.1382
Bibkey:
Cite (ACL):: Surgan Jandial, Yinong Oliver Wang, Andrea Bajcsy, and Fernando De la Torre. 2025. On the Fine-Grained Planning Abilities of VLM Web Agents. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 25347–25380, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: On the Fine-Grained Planning Abilities of VLM Web Agents (Jandial et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1382.pdf
Checklist:: 2025.findings-emnlp.1382.checklist.pdf

PDF Cite Search Checklist Fix data