Jiamu Zhou
2026
ColorBrowserAgent: Complex Long-Horizon Browser Agent with Adaptive Knowledge Evolution
Jihong Wang | Jiamu Zhou | Weiming Zhang | Teng Wang | Weiwen Liu | Zhuosheng Zhang | Xingyu Lou | Weinan Zhang | Huarong Deng | Jun Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Jihong Wang | Jiamu Zhou | Weiming Zhang | Teng Wang | Weiwen Liu | Zhuosheng Zhang | Xingyu Lou | Weinan Zhang | Huarong Deng | Jun Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
With the advancement of vision-language models, web automation has made significant progress. However, deploying autonomous agents in real-world settings remains challenging, primarily due to site heterogeneity, where generalist models lack domain-specific priors for diverse interfaces, and long-horizon instability, characterized by the accumulation of decision drift over extended interactions. To address these challenges, we introduce ColorBrowserAgent (Complex Long-Horizon Browser Agent), a knowledge-evolving agent for robust web automation. Our approach addresses these challenges through two synergistic mechanisms: human-in-the-loop knowledge adaptation that transforms sparse human feedback into reusable domain knowledge, and knowledge-aligned progressive summarization that stabilizes long interactions through memory compression. Extensive experiments on WebArena, WebChoreArena and industrial deployment show that ColorBrowserAgent consistently outperforms strong baselines. It achieves a state-of-the-art success rate of 71.2% on WebArena and maintains 47.4% performance under zero-shot transfer setting on WebChoreArena. In commercial deployment, it improves user satisfaction by 19.3% relatively, verifying its robustness in real-world scenarios.
2025
HammerBench: Fine-Grained Function-Calling Evaluation in Real Mobile Assistant Scenarios
Jun Wang | Jiamu Zhou | Xihuai Wang | Xiaoyun Mo | Haoyu Zhang | Qiqiang Lin | Cheng Jin | Muning Wen | Weinan Zhang | Qiuying Peng | Jun Wang
Findings of the Association for Computational Linguistics: ACL 2025
Jun Wang | Jiamu Zhou | Xihuai Wang | Xiaoyun Mo | Haoyu Zhang | Qiqiang Lin | Cheng Jin | Muning Wen | Weinan Zhang | Qiuying Peng | Jun Wang
Findings of the Association for Computational Linguistics: ACL 2025
Evaluating the performance of LLMs in multi-turn human-agent interactions presents significant challenges, particularly due to the complexity and variability of user behavior. In this paper, we introduce HammerBench, a novel benchmark framework for assessing LLMs’ function-calling capabilities in real-world, multi-turn dialogues. HammerBench simulates diverse mobile assistant use cases, incorporating imperfect instructions, dynamic question-answer trajectories, intent and argument shifts, and the indirect use of external information through pronouns. To construct this benchmark, we curate a comprehensive dataset derived from popular mobile app functionalities and anonymized user logs, complemented by a cost-effective data generation pipeline leveraging open-source models. HammerBench is further augmented with fine-grained interaction snapshots and metrics, enabling detailed evaluation of function-calling performance across individual conversational turns. We demonstrate the effectiveness of HammerBench by evaluating several leading LLMs and uncovering key performance trends. Our experiments reveal that different types of parameter name errors are a significant source of failure across different interaction scenarios, highlighting critical areas for further improvement in LLM robustness for mobile assistant applications.