Yufan Shen
2026
Towards Self-Evolving Agents: Enabling Autonomy through Interactive Experience Refinement
Cheng Yang | Xuemeng Yang | Licheng Wen | Daocheng Fu | Jianbiao Mei | Rong Wu | Pinlong Cai | Yufan Shen | Nianchen Deng | Jia Xu | Botian Shi | Yu Qiao | Haifeng Li
Findings of the Association for Computational Linguistics: ACL 2026
Cheng Yang | Xuemeng Yang | Licheng Wen | Daocheng Fu | Jianbiao Mei | Rong Wu | Pinlong Cai | Yufan Shen | Nianchen Deng | Jia Xu | Botian Shi | Yu Qiao | Haifeng Li
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Models often struggle with complex, multi-step operational tasks because they remain static during inference and cannot learn from past experience. To address this, we propose MUSE, a framework that enables iterative self-improvement through a hierarchical Memory Module. MUSE organizes cross-domain insights to facilitate the orchestration of long-horizon workflows. The core of our approach is an autonomous post-execution critique mechanism: after completing each sub-task, the system analyzes its operational logs and distills raw execution data into structured, reusable knowledge. This allows the agent to evolve dynamically rather than relying on fixed parameters. Evaluated on the rigorous TAC productivity benchmark, MUSE achieves new state-of-the-art results, significantly outperforming previous methods using only the streamlined Gemini-2.5 Flash model. Our analysis demonstrates that MUSE’s performance scales with the accumulation of insights and exhibits strong cross-task transferability, marking a key step toward autonomous systems capable of lifelong learning in professional environments. Demo videos can be found in our supplementary materials.
2025
RealBench: A Chinese Multi-image Understanding Benchmark Close to Real-world Scenarios
Fei Zhao | Chengqiang Lu | Yufan Shen | Qimeng Wang | Yicheng Qian | Haoxin Zhang | Yan Gao | Yiwu | Yao Hu | Zhen Wu | Shangyu Xing | Xinyu Dai
Findings of the Association for Computational Linguistics: EMNLP 2025
Fei Zhao | Chengqiang Lu | Yufan Shen | Qimeng Wang | Yicheng Qian | Haoxin Zhang | Yan Gao | Yiwu | Yao Hu | Zhen Wu | Shangyu Xing | Xinyu Dai
Findings of the Association for Computational Linguistics: EMNLP 2025
While various multimodal multi-image evaluation datasets have been emerged, but these datasets are primarily based on English, and there has yet to be a Chinese multi-image dataset. To fill this gap, we introduce RealBench, the first Chinese multimodal multi-image dataset, which contains 9393 samples and 69910 images. RealBench distinguishes itself by incorporating real user-generated content, ensuring high relevance to real-world applications. Additionally, the dataset covers a wide variety of scenes, image resolutions, and image structures, further increasing the difficulty of multi-image understanding. Ultimately, we conduct a comprehensive evaluation of RealBench using 21 multimodal LLMs of different sizes, including closed-source models that support multi-image inputs as well as open-source visual and video models. The experimental results indicate that even the most powerful closed-source models still face challenges when handling multi-image Chinese scenarios. Moreover, there remains a noticeable performance gap of around 71.8% on average between open-source visual/video models and closed-source models. These results show that RealBench provides an important research foundation for further exploring multi-image understanding capabilities in the Chinese context. Our datasets will be publicly available.