Guozhi Wang
2026
A3: Android Agent Arena for Mobile GUI Agents with Essential-State Procedural Evaluation
Yuxiang Chai | Shunye Tang | Han Xiao | Weifeng Lin | Hanhao Li | Jiayu Zhang | Liang Liu | Pengxiang Zhao | Guangyi Liu | Guozhi Wang | Shuai Ren | Rongduo Han | Haining Zhang | Siyuan Huang | Hongsheng Li
Findings of the Association for Computational Linguistics: ACL 2026
Yuxiang Chai | Shunye Tang | Han Xiao | Weifeng Lin | Hanhao Li | Jiayu Zhang | Liang Liu | Pengxiang Zhao | Guangyi Liu | Guozhi Wang | Shuai Ren | Rongduo Han | Haining Zhang | Siyuan Huang | Hongsheng Li
Findings of the Association for Computational Linguistics: ACL 2026
The advancement of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has catalyzed the development of mobile graphic user interface (GUI) AI agents, which is designed to autonomously perform tasks on mobile devices. However, a significant gap persists in mobile GUI agent evaluation, where existing benchmarks predominantly rely on either static frame assessments such as AndroidControl or offline static apps such as AndroidWorld and thus fail to capture agent performance in dynamic, real-world online mobile apps. To address this gap, we present Android Agent Arena (A3), a novel "essential-state" based procedural evaluation system for mobile GUI agents. A3 introduces a benchmark of 100 tasks derived from 20 widely-used, dynamic online apps across 20 categories from the Google Play Store, ensuring evaluation comprehension. A3 also presents a novel "essential-state" based procedural evaluation method that leverages MLLMs as reward models to progressively verify task completion and process achievement. This evaluation approach address the limitations of traditional function based evaluation methods on online dynamic apps. Furthermore, A3 includes a toolkit to streamline Android device interaction, reset online environment and apps and facilitate data collection from both human and agent demonstrations. The complete A3 system, including the benchmark and tools, will be publicly released to provide a robust foundation for future research and development in mobile GUI agents.
SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning
Jichao Wang | Liuyang Bian | Yufeng Zhou | Han Xiao | Yue Pan | Guozhi Wang | Hao Wang | Zhaoxiong Wang | Yafei Wen | Xiaoxin Chen | Shuai Ren | Lingfang Zeng
Findings of the Association for Computational Linguistics: ACL 2026
Jichao Wang | Liuyang Bian | Yufeng Zhou | Han Xiao | Yue Pan | Guozhi Wang | Hao Wang | Zhaoxiong Wang | Yafei Wen | Xiaoxin Chen | Shuai Ren | Lingfang Zeng
Findings of the Association for Computational Linguistics: ACL 2026
As Multimodal Large Language Models (MLLMs) mature, GUI agents are evolving from static interactions to complex navigation. While Reinforcement Learning (RL) has emerged as a promising paradigm for training MLLM agents on dynamic GUI tasks, its effective application faces a dilemma.Standard Offline RL often relies on static step-level data, neglecting global trajectory semantics such as task completion and execution quality. Conversely, Online RL captures the long-term dynamics but suffers from high interaction costs and potential environmental instability. To bridge this gap, we propose SOLAR-RL (Semi Online Long-horizon RL). Instead of relying solely on expensive online interactions, our framework integrates global trajectory insights directly into the offline learning process. Specifically, we reconstruct diverse rollout candidates from static data, detect the first failure point using per-step validity signals, and retroactively assign dense step-level rewards with target-aligned shaping to reflect trajectory-level execution quality—effectively simulating online feedback without interaction costs.Extensive experiments demonstrate that SOLAR-RL significantly improves long-horizon task completion rates and robustness compared to strong baselines, offering a sample-efficient solution for autonomous GUI navigation.
2025
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents
Yuxiang Chai | Siyuan Huang | Yazhe Niu | Han Xiao | Liang Liu | Guozhi Wang | Dingyu Zhang | Shuai Ren | Hongsheng Li
Findings of the Association for Computational Linguistics: ACL 2025
Yuxiang Chai | Siyuan Huang | Yazhe Niu | Han Xiao | Liang Liu | Guozhi Wang | Dingyu Zhang | Shuai Ren | Hongsheng Li
Findings of the Association for Computational Linguistics: ACL 2025
AI agents have drawn increasing attention mostly on their ability to perceive environments, understand tasks, and autonomously achieve goals. To advance research on AI agents in mobile scenarios, we introduce the Android Multi-annotation EXpo (AMEX), a comprehensive, large-scale dataset designed for generalist mobile GUI-control agents which are capable of completing tasks by directly interacting with the graphical user interface (GUI) on mobile devices. AMEX comprises over 104K high-resolution screenshots from popular mobile applications, which are annotated at multiple levels. Unlike existing GUI-related datasets, e.g., Rico, AitW, etc., AMEX includes three levels of annotations: GUI interactive element grounding, GUI screen and element functionality descriptions, and complex natural language instructions with stepwise GUI-action chains. We develop this dataset from a more instructive and detailed perspective, complementing the general settings of existing datasets. Additionally, we finetune a baseline model SPHINX Agent and illustrate the effectiveness of AMEX.