Look Where You’re Told: Instruction-Consistent Attention for GUI Grounding

Seonhoon Kim, Zhiyu Chen, Xin Li, Qun Liu


Abstract
Visual grounding in graphical user interface (GUI) requires accurate localization of UI elements from natural language instructions. Conventional coordinate generation approaches face inherent limitations, including sensitivity to resolution variations and lack of interpretability. Recently, coordinate-free attention-based methods have emerged as a promising alternative, but these methods supervise attention using only spatial location signals from ground-truth bounding boxes, without ensuring that the learned attention distributions reflect genuine semantic correspondence between the instruction and the attended visual regions. We propose Attention Cycle-Consistency (ACC), a self-supervised regularization framework that enforces bidirectional alignment between visual attention and instruction semantics. ACC introduces two complementary constraints: semantic consistency, which ensures attended visual regions contain sufficient information to reconstruct the original instruction, and spatial consistency, which requires attention distributions to remain invariant when cycled through instruction reconstruction. We further incorporate entropy regularization to encourage spatially concentrated attention. ACC is applicable as a lightweight, model-agnostic regularizer for attention-based coordinate-free grounding methods, adding zero computational overhead at inference as all auxiliary components are discarded after training.
Anthology ID:
2026.alvr-main.12
Volume:
Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Qianqi Yan, Syrielle Montariol, Yue Fan, Jing Gu, Jiayi Pan, Manling Li, Parisa Kordjamshidi, Alane Suhr, Xin Eric Wang
Venues:
ALVR | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
155–163
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.alvr-main.12/
DOI:
Bibkey:
Cite (ACL):
Seonhoon Kim, Zhiyu Chen, Xin Li, and Qun Liu. 2026. Look Where You’re Told: Instruction-Consistent Attention for GUI Grounding. In Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR), pages 155–163, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Look Where You’re Told: Instruction-Consistent Attention for GUI Grounding (Kim et al., ALVR 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.alvr-main.12.pdf