Hanhao Li
2026
MobileBench-OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real-World Environment
Qinzhuo Wu | Zhizhuo Yang | Hanhao Li | Pengzhi Gao | Wei Liu | Jian Luan
Findings of the Association for Computational Linguistics: ACL 2026
Qinzhuo Wu | Zhizhuo Yang | Hanhao Li | Pengzhi Gao | Wei Liu | Jian Luan
Findings of the Association for Computational Linguistics: ACL 2026
Recent advances in mobile Graphical User Interface (GUI) agents highlight the growing need for comprehensive evaluation benchmarks. While new online benchmarks offer more realistic testing than offline ones, they tend to focus on the agents’ task instruction-following ability while neglecting their reasoning and exploration ability. Moreover, these benchmarks do not consider the random noise in real-world mobile environments. This leads to a gap between benchmarks and real-world environments. To addressing these limitations, we propose MobileBench-OL, an online benchmark with 1080 tasks from 80 Chinese apps. It measures task execution, complex reasoning, and noise robustness of agents by including 5 subsets, which set multiple evaluation dimensions. We also provide an auto-eval framework with a reset mechanism, enabling stable and repeatable real-world benchmarking. Evaluating 13 leading GUI agents on MobileBench-OL shows significant room for improvement to meet real-world requirements. Human evaluation further confirms that MobileBench-OL can reliably measure the performance of leading GUI agents in real environments.
A3: Android Agent Arena for Mobile GUI Agents with Essential-State Procedural Evaluation
Yuxiang Chai | Shunye Tang | Han Xiao | Weifeng Lin | Hanhao Li | Jiayu Zhang | Liang Liu | Pengxiang Zhao | Guangyi Liu | Guozhi Wang | Shuai Ren | Rongduo Han | Haining Zhang | Siyuan Huang | Hongsheng Li
Findings of the Association for Computational Linguistics: ACL 2026
Yuxiang Chai | Shunye Tang | Han Xiao | Weifeng Lin | Hanhao Li | Jiayu Zhang | Liang Liu | Pengxiang Zhao | Guangyi Liu | Guozhi Wang | Shuai Ren | Rongduo Han | Haining Zhang | Siyuan Huang | Hongsheng Li
Findings of the Association for Computational Linguistics: ACL 2026
The advancement of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has catalyzed the development of mobile graphic user interface (GUI) AI agents, which is designed to autonomously perform tasks on mobile devices. However, a significant gap persists in mobile GUI agent evaluation, where existing benchmarks predominantly rely on either static frame assessments such as AndroidControl or offline static apps such as AndroidWorld and thus fail to capture agent performance in dynamic, real-world online mobile apps. To address this gap, we present Android Agent Arena (A3), a novel "essential-state" based procedural evaluation system for mobile GUI agents. A3 introduces a benchmark of 100 tasks derived from 20 widely-used, dynamic online apps across 20 categories from the Google Play Store, ensuring evaluation comprehension. A3 also presents a novel "essential-state" based procedural evaluation method that leverages MLLMs as reward models to progressively verify task completion and process achievement. This evaluation approach address the limitations of traditional function based evaluation methods on online dynamic apps. Furthermore, A3 includes a toolkit to streamline Android device interaction, reset online environment and apps and facilitate data collection from both human and agent demonstrations. The complete A3 system, including the benchmark and tools, will be publicly released to provide a robust foundation for future research and development in mobile GUI agents.
2025
OAgents: An Empirical Study of Building Effective Agents
He Zhu | Tianrui Qin | King Zhu | Heyuan Huang | Yeyi Guan | Jinxiang Xia | Hanhao Li | Yi Yao | Ningning Wang | Pai Liu | Tianhao Peng | Xin Gui | Li Xiaowan | Yuhui Liu | Xiangru Tang | Jian Yang | Ge Zhang | Xitong Gao | Yuchen Eleanor Jiang | Changwang Zhang | Jun Wang | Jiaheng Liu | Wangchunshu Zhou
Findings of the Association for Computational Linguistics: EMNLP 2025
He Zhu | Tianrui Qin | King Zhu | Heyuan Huang | Yeyi Guan | Jinxiang Xia | Hanhao Li | Yi Yao | Ningning Wang | Pai Liu | Tianhao Peng | Xin Gui | Li Xiaowan | Yuhui Liu | Xiangru Tang | Jian Yang | Ge Zhang | Xitong Gao | Yuchen Eleanor Jiang | Changwang Zhang | Jun Wang | Jiaheng Liu | Wangchunshu Zhou
Findings of the Association for Computational Linguistics: EMNLP 2025
Recently, Agentic AI has become an increasingly popular field of research. However, we argue that current practices on agent research are far from standard, rigorous scientific research, which makes it hard to conduct apples-to-apples comparisons among and against existing methods. As a result, it is still obscure how different design choices in an agent framework impact its effectiveness, and measuring progress on agent research remains very hard. In this work, we conduct a systematic empirical study on the GAIA benchmark to investigate the impact of different popular design choices within key agent components in a fair and rigorous way. To begin with, we find that the lack of a standard evaluation protocol makes previous works, even the open-sourced ones, not reproducible, and the variance between different random runs is often non-negligible. Therefore, we first introduce a more robust evaluation protocol to make comparisons more stable. Our empirical study then unveils which components and designs, as well as correlations between these designs, are the keys for building effective agents, while others are not and redundant, despite seemingly making sense. With the insights gained from our empirical study, we build and open-source OAgents, a new foundation agent framework that achieves state-of-the-art performance among open-source projects, providing a good starting point and guidelines for building effective agents. More importantly, supports various design choices for agent components in a modularized way, facilitating future scientific research on Agentic AI.
Search
Fix author
Co-authors
- Yuxiang Chai 1
- Pengzhi Gao 1
- Xitong Gao 1
- Yeyi Guan 1
- Xin Gui 1
- Rongduo Han 1
- Heyuan Huang 1
- Siyuan Huang 1
- Yuchen Eleanor Jiang 1
- Hongsheng Li 1
- Weifeng Lin 1
- Guangyi Liu 1
- Jiaheng Liu 1
- Liang Liu (陆亮) 1
- Pai Liu 1
- Wei Liu 1
- Yuhui Liu 1
- Jian Luan 1
- Tianhao Peng 1
- Tianrui Qin 1
- Shuai Ren 1
- Shunye Tang 1
- Xiangru Tang 1
- Guozhi Wang 1
- Jun Wang 1
- Ningning Wang 1
- Qinzhuo Wu 1
- Jinxiang Xia 1
- Han Xiao 1
- Li Xiaowan 1
- Jian Yang 1
- Zhizhuo Yang 1
- Yi Yao 1
- Changwang Zhang 1
- Ge Zhang 1
- Haining Zhang 1
- Jiayu Zhang 1
- Pengxiang Zhao 1
- Wangchunshu Zhou 1
- He Zhu 1
- King Zhu 1