UI-Hawk: Unleashing the Screen Stream Understanding for Mobile GUI Agents
Jiwen Zhang, Ya-Qi Yu, Minghui Liao, WenTao Li, Jihao Wu, Zhongyu Wei
Abstract
Graphical User Interface (GUI) agents are expected to precisely operate on the screens of digital devices. Existing GUI agents merely depend on current visual observations and plain-text action history, ignoring the significance of history screens. To mitigate this issue, we propose **UI-Hawk**, a multi-modal GUI agent specially designed to process screen streams encountered during GUI navigation. UI-Hawk incorporates a history-aware visual encoder to handle the screen sequences. To acquire a better understanding of screen streams, we select four fundamental tasks—UI grounding, UI referring, screen question answering, and screen summarization. We further propose a curriculum learning strategy to subsequently guide the model from fundamental tasks to advanced screen-stream comprehension.Along with the efforts above, we have also created a benchmark FunUI to quantitatively evaluate the fundamental screen understanding ability of MLLMs. Extensive experiments on FunUI and GUI navigation benchmarks consistently validate that screen stream understanding is essential for GUI tasks.Our code and data are now available at https://github.com/IMNearth/UIHawk.- Anthology ID:
- 2025.emnlp-main.920
- Volume:
- Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 18228–18247
- Language:
- URL:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.920/
- DOI:
- Cite (ACL):
- Jiwen Zhang, Ya-Qi Yu, Minghui Liao, WenTao Li, Jihao Wu, and Zhongyu Wei. 2025. UI-Hawk: Unleashing the Screen Stream Understanding for Mobile GUI Agents. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 18228–18247, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- UI-Hawk: Unleashing the Screen Stream Understanding for Mobile GUI Agents (Zhang et al., EMNLP 2025)
- PDF:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.920.pdf