Jihao Wu
2025
UI-Hawk: Unleashing the Screen Stream Understanding for Mobile GUI Agents
Jiwen Zhang
|
Ya-Qi Yu
|
Minghui Liao
|
WenTao Li
|
Jihao Wu
|
Zhongyu Wei
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Graphical User Interface (GUI) agents are expected to precisely operate on the screens of digital devices. Existing GUI agents merely depend on current visual observations and plain-text action history, ignoring the significance of history screens. To mitigate this issue, we propose **UI-Hawk**, a multi-modal GUI agent specially designed to process screen streams encountered during GUI navigation. UI-Hawk incorporates a history-aware visual encoder to handle the screen sequences. To acquire a better understanding of screen streams, we select four fundamental tasks—UI grounding, UI referring, screen question answering, and screen summarization. We further propose a curriculum learning strategy to subsequently guide the model from fundamental tasks to advanced screen-stream comprehension.Along with the efforts above, we have also created a benchmark FunUI to quantitatively evaluate the fundamental screen understanding ability of MLLMs. Extensive experiments on FunUI and GUI navigation benchmarks consistently validate that screen stream understanding is essential for GUI tasks.Our code and data are now available at https://github.com/IMNearth/UIHawk.
2024
Android in the Zoo: Chain-of-Action-Thought for GUI Agents
Jiwen Zhang
|
Jihao Wu
|
Teng Yihua
|
Minghui Liao
|
Nuo Xu
|
Xiao Xiao
|
Zhongyu Wei
|
Duyu Tang
Findings of the Association for Computational Linguistics: EMNLP 2024
Large language model (LLM) leads to a surge of autonomous GUI agents for smartphone, which completes a task triggered by natural language through predicting a sequence of actions of API. Even though the task highly relies on past actions and visual observations, existing studies typically consider little semantic information carried out by intermediate screenshots and screen operations. To address this, this work presents Chain-of-Action-Thought (dubbed CoAT), which takes the description of the previous actions, the current screen, and more importantly the action thinking of what actions should be performed and the outcomes led by the chosen action. We demonstrate that, in a zero-shot setting upon three off-the-shelf LMMs, CoAT significantly improves the action prediction compared to previous proposed context modeling. To further facilitate the research in this line, we construct a dataset Android-In-The-Zoo (AitZ), which contains 18,643 screen-action pairs together with chain-of-action-thought annotations. Experiments show that fine-tuning a 1B model (i.e. AUTO-UI-base) on our AitZ dataset achieves on-par performance with CogAgent-Chat-18B.
Search
Fix author
Co-authors
- Minghui Liao 2
- Zhongyu Wei (魏忠钰) 2
- Jiwen Zhang (张霁雯) 2
- Wentao Li 1
- Duyu Tang 1
- show all...