WinSpot: GUI Grounding Benchmark with Multimodal Large Language Models

Zheng Hui; Yinheng Li; Dan Zhao; Colby Banbury; Tianyi Chen; Kazuhito Koishida

WinSpot: GUI Grounding Benchmark with Multimodal Large Language Models

Zheng Hui, Yinheng Li, Dan Zhao, Colby Banbury, Tianyi Chen, Kazuhito Koishida

Abstract

Graphical User Interface (GUI) automation relies on accurate GUI grounding. However, obtaining large-scale, high-quality labeled data remains a key challenge, particularly in desktop environments like Windows Operating System (OS). Existing datasets primarily focus on structured web-based elements, leaving a gap in real-world GUI interaction data for non-web applications. To address this, we introduce a new framework that leverages LLMs to generate large-scale GUI grounding data, enabling automated and scalable labeling across diverse interfaces. To ensure high accuracy and reliability, we manually validated and refined 5,000 GUI coordinate-instruction pairs, creating WinSpot—the first benchmark specifically designed for GUI grounding tasks in Windows environments. WinSpot provides a high-quality dataset for training and evaluating visual GUI agents, establishing a foundation for future research in GUI automation across diverse and unstructured desktop environments.

Anthology ID:: 2025.acl-short.85
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1086–1096
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-short.85/
DOI:
Bibkey:
Cite (ACL):: Zheng Hui, Yinheng Li, Dan Zhao, Colby Banbury, Tianyi Chen, and Kazuhito Koishida. 2025. WinSpot: GUI Grounding Benchmark with Multimodal Large Language Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1086–1096, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: WinSpot: GUI Grounding Benchmark with Multimodal Large Language Models (Hui et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-short.85.pdf

PDF Cite Search Fix data