Yuxiang Chai
2026
LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration Benchmark
Guangyi Liu | Pengxiang Zhao | Liang Liu | Zhiming Chen | Yuxiang Chai | Yaozhen Liang | WenHao Wang | Siheng Chen | Zhengxi Lu | Shuai Ren | Hao Wang | Shibo He | Yong Liu | Wenchao Meng
Findings of the Association for Computational Linguistics: ACL 2026
Guangyi Liu | Pengxiang Zhao | Liang Liu | Zhiming Chen | Yuxiang Chai | Yaozhen Liang | WenHao Wang | Siheng Chen | Zhengxi Lu | Shuai Ren | Hao Wang | Shibo He | Yong Liu | Wenchao Meng
Findings of the Association for Computational Linguistics: ACL 2026
Mobile GUI agents show promise in automating tasks but face significant generalization challenges in long-tail scenarios. While learning from few-shot demonstrations is an emerging solution, its progress is hindered by two critical gaps: the lack of a comprehensive benchmark for systematic evaluation on mobile devices, and the absence of a systematic framework designed to learn from demonstrations in this domain. To address these gaps, we introduce LearnGUI, the first comprehensive benchmark designed for studying demonstration-based learning in mobile agents, comprising 2,252 offline and 101 online tasks. We further develop LearnAct, a modular agent framework engineered to systematically extract, retrieve, and leverage knowledge from visual demonstrations. Extensive evaluations across six backbone models validate our approach: LearnAct achieves dramatic improvements for general-purpose models (e.g., Gemini-2.5-Pro: 38.5%→58.9%) and specialized models alike (e.g., UI-TARS-7B-SFT’s online success rate: 18.1%→32.8%), demonstrating consistent gains across model architectures. Our work provides a robust benchmark and a systematic framework, paving the way for more adaptable and practical mobile agents. Our code and data are publicly available at https://lgy0404.github.io/LearnAct/.
MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents
Pengxiang Zhao | Guangyi Liu | Yaozhen Liang | Weiqing He | Zhengxi Lu | WenHao Wang | Yuehao Huang | Yuxiang Chai | Zhaolu Kang | Yaxuan Guo | Hao Wang | Kexin Zhang | Liang Liu | Yong Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Pengxiang Zhao | Guangyi Liu | Yaozhen Liang | Weiqing He | Zhengxi Lu | WenHao Wang | Yuehao Huang | Yuxiang Chai | Zhaolu Kang | Yaxuan Guo | Hao Wang | Kexin Zhang | Liang Liu | Yong Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shortcuts such as APIs and deep-links have emerged as efficient complements to flexible GUI operations, fostering a promising hybrid paradigm for MLLM-based mobile automation. However, systematic evaluation of GUI–shortcut hybrid agents remains largely underexplored. To bridge this gap, we introduce **MAS-Bench**, a benchmark that pioneers the evaluation of GUI-shortcut hybrid agents with a specific focus on the mobile domain. Beyond merely using predefined shortcuts, MAS-Bench assesses an agent’s capability to *autonomously generate* shortcuts by discovering and creating reusable, low-cost workflows. It features 139 complex tasks across 11 real-world applications, a knowledge base of 88 predefined shortcuts (APIs, deep-links, RPA scripts), and 9 evaluation metrics. Experiments demonstrate that hybrid agents achieve up to 68.3% success rate and 39% greater execution efficiency than GUI-only counterparts. Furthermore, our evaluation framework effectively reveals the quality gap between predefined and agent-generated shortcuts, validating its capability to assess shortcut generation methods. MAS-Bench addresses the lack of systematic benchmarks for GUI-shortcut hybrid mobile agents, providing a foundational platform for future advancements in creating more efficient and robust intelligent agents.
A3: Android Agent Arena for Mobile GUI Agents with Essential-State Procedural Evaluation
Yuxiang Chai | Shunye Tang | Han Xiao | Weifeng Lin | Hanhao Li | Jiayu Zhang | Liang Liu | Pengxiang Zhao | Guangyi Liu | Guozhi Wang | Shuai Ren | Rongduo Han | Haining Zhang | Siyuan Huang | Hongsheng Li
Findings of the Association for Computational Linguistics: ACL 2026
Yuxiang Chai | Shunye Tang | Han Xiao | Weifeng Lin | Hanhao Li | Jiayu Zhang | Liang Liu | Pengxiang Zhao | Guangyi Liu | Guozhi Wang | Shuai Ren | Rongduo Han | Haining Zhang | Siyuan Huang | Hongsheng Li
Findings of the Association for Computational Linguistics: ACL 2026
The advancement of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has catalyzed the development of mobile graphic user interface (GUI) AI agents, which is designed to autonomously perform tasks on mobile devices. However, a significant gap persists in mobile GUI agent evaluation, where existing benchmarks predominantly rely on either static frame assessments such as AndroidControl or offline static apps such as AndroidWorld and thus fail to capture agent performance in dynamic, real-world online mobile apps. To address this gap, we present Android Agent Arena (A3), a novel "essential-state" based procedural evaluation system for mobile GUI agents. A3 introduces a benchmark of 100 tasks derived from 20 widely-used, dynamic online apps across 20 categories from the Google Play Store, ensuring evaluation comprehension. A3 also presents a novel "essential-state" based procedural evaluation method that leverages MLLMs as reward models to progressively verify task completion and process achievement. This evaluation approach address the limitations of traditional function based evaluation methods on online dynamic apps. Furthermore, A3 includes a toolkit to streamline Android device interaction, reset online environment and apps and facilitate data collection from both human and agent demonstrations. The complete A3 system, including the benchmark and tools, will be publicly released to provide a robust foundation for future research and development in mobile GUI agents.
2025
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents
Yuxiang Chai | Siyuan Huang | Yazhe Niu | Han Xiao | Liang Liu | Guozhi Wang | Dingyu Zhang | Shuai Ren | Hongsheng Li
Findings of the Association for Computational Linguistics: ACL 2025
Yuxiang Chai | Siyuan Huang | Yazhe Niu | Han Xiao | Liang Liu | Guozhi Wang | Dingyu Zhang | Shuai Ren | Hongsheng Li
Findings of the Association for Computational Linguistics: ACL 2025
AI agents have drawn increasing attention mostly on their ability to perceive environments, understand tasks, and autonomously achieve goals. To advance research on AI agents in mobile scenarios, we introduce the Android Multi-annotation EXpo (AMEX), a comprehensive, large-scale dataset designed for generalist mobile GUI-control agents which are capable of completing tasks by directly interacting with the graphical user interface (GUI) on mobile devices. AMEX comprises over 104K high-resolution screenshots from popular mobile applications, which are annotated at multiple levels. Unlike existing GUI-related datasets, e.g., Rico, AitW, etc., AMEX includes three levels of annotations: GUI interactive element grounding, GUI screen and element functionality descriptions, and complex natural language instructions with stepwise GUI-action chains. We develop this dataset from a more instructive and detailed perspective, complementing the general settings of existing datasets. Additionally, we finetune a baseline model SPHINX Agent and illustrate the effectiveness of AMEX.
Search
Fix author
Co-authors
- Liang Liu (陆亮) 4
- Guangyi Liu 3
- Shuai Ren 3
- Pengxiang Zhao 3
- Siyuan Huang 2
- Hongsheng Li 2
- Yaozhen Liang 2
- Yong Liu 2
- Zhengxi Lu 2
- Wenhao Wang 2
- Hao Wang 2
- Guozhi Wang 2
- Han Xiao 2
- Zhiming Chen 1
- Siheng Chen 1
- Yaxuan Guo 1
- Rongduo Han 1
- Shibo He 1
- Weiqing He 1
- Yuehao Huang 1
- Zhaolu Kang 1
- Hanhao Li 1
- Weifeng Lin 1
- Wenchao Meng 1
- Yazhe Niu 1
- Shunye Tang 1
- Kexin Zhang 1
- Jiayu Zhang 1
- Haining Zhang 1
- Dingyu Zhang 1