Visual Grounding for User Interfaces

Yijun Qian, Yujie Lu, Alexander Hauptmann, Oriana Riva


Abstract
Enabling autonomous language agents to drive application user interfaces (UIs) as humans do can significantly expand the capability of today’s API-based agents. Essential to this vision is the ability of agents to ground natural language commands to on-screen UI elements. Prior UI grounding approaches work by relaying on developer-provided UI metadata (UI trees, such as web DOM, and accessibility labels) to detect on-screen elements. However, such metadata is often unavailable or incomplete. Object detection techniques applied to UI screens remove this dependency, by inferring location and types of UI elements directly from the UI’s visual appearance. The extracted semantics, however, are too limited to directly enable grounding. We overcome the limitations of both approaches by introducing the task of visual UI grounding, which unifies detection and grounding. A model takes as input a UI screenshot and a free-form language expression, and must identify the referenced UI element. We propose a solution to this problem, LVG, which learns UI element detection and grounding using a new technique called layout-guided contrastive learning, where the semantics of individual UI objects are learned also from their visual organization. Due to the scarcity of UI datasets, LVG integrates synthetic data in its training using multi-context learning. LVG outperforms baselines pre-trained on much larger datasets by over 4.9 points in top-1 accuracy, thus demonstrating its effectiveness.
Anthology ID:
2024.naacl-industry.9
Volume:
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Yi Yang, Aida Davani, Avi Sil, Anoop Kumar
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
97–107
Language:
URL:
https://aclanthology.org/2024.naacl-industry.9
DOI:
Bibkey:
Cite (ACL):
Yijun Qian, Yujie Lu, Alexander Hauptmann, and Oriana Riva. 2024. Visual Grounding for User Interfaces. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), pages 97–107, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Visual Grounding for User Interfaces (Qian et al., NAACL 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/jeptaln-2024-ingestion/2024.naacl-industry.9.pdf