Yang Xiao
Other people with similar names: Yang Xiao, Yang Xiao
Unverified author pages with similar names: Yang Xiao
2026
AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts
Keyu Li | Junhao Shi | Yang Xiao | Mohan Jiang | Jie Sun | Yunze Wu | Dayuan Fu | Shijie Xia | Xiaojie Cai | Tianze Xu | Weiye Si | Wenjie Li | Dequan Wang | Pengfei Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Keyu Li | Junhao Shi | Yang Xiao | Mohan Jiang | Jie Sun | Yunze Wu | Dayuan Fu | Shijie Xia | Xiaojie Cai | Tianze Xu | Weiye Si | Wenjie Li | Dequan Wang | Pengfei Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture long-horizon real-world scenarios. Moreover, the reliance on human-in-the-loop feedback for realistic tasks creates a scalability bottleneck, hindering automated rollout collection and evaluation. To bridge this gap, we introduce AgencyBench, a comprehensive benchmark derived from daily AI usage, evaluating 6 core agentic capabilities across 32 real-world scenarios, comprising 138 tasks with specific queries, deliverables, and rubrics. These scenarios require an average of 90 tool calls, 1 million tokens, and hours of execution time to resolve. To enable automated evaluation, we employ a user simulation agent to provide iterative feedback, and a Docker sandbox to conduct visual and functional rubric-based assessment. Experiments reveal that closed-source models significantly outperform open-source models (48.4% vs 32.1%). Further analysis reveals significant disparities across models in resource efficiency, feedback-driven self-correction, and specific tool-use preferences.
2025
Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States
Yang Xiao | Jiashuo Wang | Qiancheng Xu | Changhe Song | Chunpu Xu | Yi Cheng | Wenjie Li | Pengfei Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yang Xiao | Jiashuo Wang | Qiancheng Xu | Changhe Song | Chunpu Xu | Yi Cheng | Wenjie Li | Pengfei Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
As Large Language Models (LLMs) increasingly participate in human-AI interactions, evaluating their Theory of Mind (ToM) capabilities - particularly their ability to track dynamic mental states - becomes crucial. While existing benchmarks assess basic ToM abilities, they predominantly focus on static snapshots of mental states, overlooking the temporal evolution that characterizes real-world social interactions. We present **DynToM**, a novel benchmark specifically designed to evaluate LLMs’ ability to understand and track the temporal progression of mental states across interconnected scenarios. Through a systematic four-step framework, we generate 1,100 social contexts encompassing 5,500 scenarios and 78,100 questions, each validated for realism and quality. Our comprehensive evaluation of ten state-of-the-art LLMs reveals that their average performance underperforms humans by 44.7%, with performance degrading significantly when tracking and reasoning about the shift of mental states. This performance gap highlights fundamental limitations in current LLMs’ ability to model the dynamic nature of human mental states.