R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding

Joonhyung Park; Peng Tang; Sagnik Das; Srikar Appalaraju; Kunwar Yashraj Singh; R. Manmatha; Shabnam Ghadar

R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding

Joonhyung Park, Peng Tang, Sagnik Das, Srikar Appalaraju, Kunwar Yashraj Singh, R. Manmatha, Shabnam Ghadar

Abstract

Visual agent models for automating human activities on Graphical User Interfaces (GUIs) have emerged as a promising research direction, driven by advances in large Vision Language Models (VLMs). A critical challenge in GUI automation is the precise grounding of interface elements across diverse platforms. Existing vision-only GUI agents directly ground elements from large and cluttered screenshots, requiring them to process substantial irrelevant information that compromises their accuracy. In addition, these approaches typically employ basic cross-entropy loss for learning grounding objectives, which fails to effectively capture grounding quality compared to established object detection metrics like Intersection-over-Union (IoU). To address these issues, we introduce R-VLM, a novel GUI grounding approach that leverages zoomed-in region proposals for precise element localization. We also propose an IoU-aware objective function that facilitates model convergence toward high IoU predictions. Our approach bridges the gap between VLMs and conventional object detection techniques, improving the state-of-the-art grounding accuracy by 13% across diverse GUI platforms on the GUI grounding benchmarks ScreenSpot and AgentStudio. In addition, our R-VLM approach shows 3.2-9.7% absolute accuracy improvements in GUI navigation tasks on the AITW and Mind2Web benchmarks.

Anthology ID:: 2025.findings-acl.501
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venues:: Findings | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9669–9685
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.501/
DOI:
Bibkey:
Cite (ACL):: Joonhyung Park, Peng Tang, Sagnik Das, Srikar Appalaraju, Kunwar Yashraj Singh, R. Manmatha, and Shabnam Ghadar. 2025. R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding. In Findings of the Association for Computational Linguistics: ACL 2025, pages 9669–9685, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding (Park et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.501.pdf

PDF Cite Search Fix data