Yiqin Wang
2025
Ponder & Press: Advancing Visual GUI Agent towards General Computer Control
Yiqin Wang
|
Haoji Zhang
|
Jingqi Tian
|
Yansong Tang
Findings of the Association for Computational Linguistics: ACL 2025
Most existing GUI agents typically depend on non-vision inputs like HTML source code or accessibility trees, limiting flexibility across diverse software environments and platforms. Current multimodal large language models (MLLMs), though excel at using vision to ground real-world objects, often struggle with accurately localizing GUI elements – a critical requirement for effective GUI automation – due to the semantic gap between real-world objects and GUI elements. In this work, we introduce Ponder & Press, a divide-and-conquer framework for general computer control that uses only visual input. Our approach combines a general-purpose MLLM as an ‘interpreter’, responsible for translating high-level user instructions into detailed action descriptions, with a GUI-specific MLLM as a ‘locator’ that precisely locates GUI elements for action placement. By leveraging a purely visual input, our agent offers a versatile, human-like interaction paradigm applicable to various applications. Ponder & Press locator outperforms existing models by +22.5% on the ScreenSpot GUI grounding benchmark. More offline and interactive agent benchmarks across various GUI environments – including web pages, desktop software, and mobile UIs – demonstrate that the Ponder & Press framework achieves state-of-the-art performance, highlighting the potential of visual GUI agents.