Rong Wu
2026
The Agent’s First Day: Benchmarking Learning, Exploration, and Scheduling in the Workplace Scenarios
Daocheng Fu | Jianbiao Mei | Rong Wu | Xuemeng Yang | Jia Xu | Ding Wang | Pinlong Cai | Yong Liu | Licheng Wen | Botian Shi
Findings of the Association for Computational Linguistics: ACL 2026
Daocheng Fu | Jianbiao Mei | Rong Wu | Xuemeng Yang | Jia Xu | Ding Wang | Pinlong Cai | Yong Liu | Licheng Wen | Botian Shi
Findings of the Association for Computational Linguistics: ACL 2026
The rapid evolution of Multi-modal Large Language Models (MLLMs) has advanced workflow automation; however, existing research mainly targets performance upper bounds in static environments, overlooking robustness for stochastic real-world deployment. We identify three key challenges: dynamic task scheduling, active exploration under uncertainty, and continuous learning from experience. To bridge this gap, we introduce TraineeBench, a dynamic evaluation environment that simulates a "trainee" agent continuously exploring a novel setting. Unlike traditional benchmarks, TraineeBench evaluates agents along three dimensions: (1) context-aware scheduling for streaming tasks with varying priorities; (2) prudent information acquisition to reduce hallucination via active exploration; and (3) continuous evolution by distilling generalized strategies from rule-based, dynamically generated tasks. Experiments show that cutting-edge agents have significant deficiencies in dynamic environments, especially in active exploration and continual learning. Our work establishes a framework for assessing agent reliability, shifting evaluation from static tests to realistic, production-oriented scenarios.
Towards Self-Evolving Agents: Enabling Autonomy through Interactive Experience Refinement
Cheng Yang | Xuemeng Yang | Licheng Wen | Daocheng Fu | Jianbiao Mei | Rong Wu | Pinlong Cai | Yufan Shen | Nianchen Deng | Jia Xu | Botian Shi | Yu Qiao | Haifeng Li
Findings of the Association for Computational Linguistics: ACL 2026
Cheng Yang | Xuemeng Yang | Licheng Wen | Daocheng Fu | Jianbiao Mei | Rong Wu | Pinlong Cai | Yufan Shen | Nianchen Deng | Jia Xu | Botian Shi | Yu Qiao | Haifeng Li
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Models often struggle with complex, multi-step operational tasks because they remain static during inference and cannot learn from past experience. To address this, we propose MUSE, a framework that enables iterative self-improvement through a hierarchical Memory Module. MUSE organizes cross-domain insights to facilitate the orchestration of long-horizon workflows. The core of our approach is an autonomous post-execution critique mechanism: after completing each sub-task, the system analyzes its operational logs and distills raw execution data into structured, reusable knowledge. This allows the agent to evolve dynamically rather than relying on fixed parameters. Evaluated on the rigorous TAC productivity benchmark, MUSE achieves new state-of-the-art results, significantly outperforming previous methods using only the streamlined Gemini-2.5 Flash model. Our analysis demonstrates that MUSE’s performance scales with the accumulation of insights and exhibits strong cross-task transferability, marking a key step toward autonomous systems capable of lifelong learning in professional environments. Demo videos can be found in our supplementary materials.