Jianan Zhang


2026

While AndroidWorld has become the dominant mobile-use benchmark due to its reproducible environment and deterministic evaluation, recent agents achieving over 90% success rates indicate saturation and motivate the need for greater challenge. In addition, its environment lacks key application categories, such as e-commerce and enterprise communication, and does not reflect realistic mobile-use scenarios characterized by vague user instructions and hybrid tool usage. We introduce MobileWorld, a substantially more challenging benchmark with 201 tasks across 20 applications that reflects real-world usage through long-horizon, cross-application workflows requiring nearly twice as many steps (27.8 vs. 14.3) and featuring significantly more multi-app tasks (62.2% vs. 9.5%) than AndroidWorld. MobileWorld balances production-grade utility and reproducible evaluation using open-source alternatives to industry standards (e.g., Mattermost for Slack), enabling full observability through source code modification and direct database access. Beyond standard GUI manipulation, MobileWorld introduces novel task categories including agent-user interaction and Model Context Protocol (MCP)-augmented tasks for evaluating agents in user-aware, hybrid-tool scenarios. We develop a planner-executor framework with extended action spaces supporting user interactions and MCP calls. Results show a sharp performance drop from AndroidWorld, with the best agentic framework and end-to-end model achieving 51.7% and 20.9% success rates, respectively, highlighting substantial room for future research.

2024

Reasoning over the Temporal Knowledge Graph (TKG) that predicts facts in the future has received much attention. Most previous works attempt to model temporal dynamics with knowledge graphs and graph convolution networks. However, these methods lack the consideration of high-order interactions between objects in TKG, which is an important factor to predict future facts. To address this problem, we introduce dynamic hypergraph embedding for temporal knowledge graph reasoning. Specifically, we obtain high-order interactions by constructing hypergraphs based on temporal knowledge graphs at different timestamps. Besides, we integrate the differences caused by time into the hypergraph representation in order to fit TKG. Then, we adapt dynamic meta-embedding for temporal hypergraph representation that allows our model to choose the appropriate high-order interactions for downstream reasoning. Experimental results on public TKG datasets show that our method outperforms the baselines. Furthermore, the analysis part demonstrates that the proposed method brings good interpretation for the predicted results.