Yang Xiao
Other people with similar names: Yang Xiao, Yang Xiao
Unverified author pages with similar names: Yang Xiao
2026
AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts
Keyu Li | Junhao Shi | Yang Xiao | Mohan Jiang | Jie Sun | Yunze Wu | Dayuan Fu | Shijie Xia | Xiaojie Cai | Tianze Xu | Weiye Si | Wenjie Li | Dequan Wang | Pengfei Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Keyu Li | Junhao Shi | Yang Xiao | Mohan Jiang | Jie Sun | Yunze Wu | Dayuan Fu | Shijie Xia | Xiaojie Cai | Tianze Xu | Weiye Si | Wenjie Li | Dequan Wang | Pengfei Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture long-horizon real-world scenarios. Moreover, the reliance on human-in-the-loop feedback for realistic tasks creates a scalability bottleneck, hindering automated rollout collection and evaluation. To bridge this gap, we introduce AgencyBench, a comprehensive benchmark derived from daily AI usage, evaluating 6 core agentic capabilities across 32 real-world scenarios, comprising 138 tasks with specific queries, deliverables, and rubrics. These scenarios require an average of 90 tool calls, 1 million tokens, and hours of execution time to resolve. To enable automated evaluation, we employ a user simulation agent to provide iterative feedback, and a Docker sandbox to conduct visual and functional rubric-based assessment. Experiments reveal that closed-source models significantly outperform open-source models (48.4% vs 32.1%). Further analysis reveals significant disparities across models in resource efficiency, feedback-driven self-correction, and specific tool-use preferences.
SepSeq: A Training-Free Framework for Long Numerical Sequence Processing in LLMs
Jie Sun | Yu Liu | Lu Han | Qiwen Deng | Xiang Shu | Yang Xiao | Lintao Ma | Xingyu Lu | Jun Zhou | Pengfei Liu | Jiancan Wu | Xiang Wang
Findings of the Association for Computational Linguistics: ACL 2026
Jie Sun | Yu Liu | Lu Han | Qiwen Deng | Xiang Shu | Yang Xiao | Lintao Ma | Xingyu Lu | Jun Zhou | Pengfei Liu | Jiancan Wu | Xiang Wang
Findings of the Association for Computational Linguistics: ACL 2026
While transformer-based Large Language Models (LLMs) theoretically support massive context windows, they suffer from severe performance degradation when processing long numerical sequences. We attribute this failure to the attention dispersion in the Softmax mechanism, which prevents the model from concentrating attention. To overcome this, we propose Separate Sequence (SepSeq), a training-free, plug-and-play framework to mitigate dispersion by strategically inserting separator tokens. Mechanistically, we demonstrate that separator tokens act as an attention anchor, recalibrating attention to focus on local segments while preserving global context. Extensive evaluations on 9 widely-adopted LLMs confirm the effectiveness of our approach: SepSeq yields an average relative accuracy improvement of 35.6% across diverse domains while reducing 16.4% inference token consumption.
LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces
Yukang Feng | Jianwen Sun | Zelai Yang | Jiaxin Ai | Chuanhao Li | Zizhen Li | Fanrui Zhang | Kang He | Rui Ma | Jifan Lin | Jie Sun | Yang Xiao | Sizhuo Zhou | Wenxiao Wu | Yiming Liu | Pengfei Liu | Shenglin Zhang | Kaipeng Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Yukang Feng | Jianwen Sun | Zelai Yang | Jiaxin Ai | Chuanhao Li | Zizhen Li | Fanrui Zhang | Kang He | Rui Ma | Jifan Lin | Jie Sun | Yang Xiao | Sizhuo Zhou | Wenxiao Wu | Yiming Liu | Pengfei Liu | Shenglin Zhang | Kaipeng Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Recent advances in AI-assisted programming have empowered agents to execute complex workflows via command-line interfaces, however, existing benchmarks are limited by short task horizons, data contamination from GitHub scraping, and a lack of fine-grained evaluation metrics, fail to rigorously evaluate the long-horizon planning and execution capabilities essential for realistic software engineering. To address these gaps, we introduce LongCLI-Bench, a comprehensive benchmark designed to evaluate agentic capabilities across long-horizon, realistic, sequential engineering tasks. We curated 20 high-quality, long-horizon tasks from over 1,000 computer science assignments and real-world workflows, covering four engineering categories: from scratch, feature addition, bug fixing, and refactoring. LongCLI-Bench employs a dual-set testing protocol, which measures requirement fulfillment (fail(→)pass) and regression avoidance (pass(→)pass), and incorporates step-level scoring to pinpoint execution failures. Extensive experiments reveal that even state-of-the-art agents achieve pass rates below 20% in LongCLI-Bench. Step-level analysis further indicates that the majority of tasks stall at less than 30% completion, highlighting that critical failures often occur in the early stages. Although self-correction offers marginal gains, human-agent collaboration through plan injection and interactive guidance yields significantly higher improvements. These results highlight that future research must emphasize the development of synergistic human-agent workflows alongside advances in agents’ planning and execution capabilities to overcome key challenges in long-horizon task performance.
2025
Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States
Yang Xiao | Jiashuo Wang | Qiancheng Xu | Changhe Song | Chunpu Xu | Yi Cheng | Wenjie Li | Pengfei Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yang Xiao | Jiashuo Wang | Qiancheng Xu | Changhe Song | Chunpu Xu | Yi Cheng | Wenjie Li | Pengfei Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
As Large Language Models (LLMs) increasingly participate in human-AI interactions, evaluating their Theory of Mind (ToM) capabilities - particularly their ability to track dynamic mental states - becomes crucial. While existing benchmarks assess basic ToM abilities, they predominantly focus on static snapshots of mental states, overlooking the temporal evolution that characterizes real-world social interactions. We present **DynToM**, a novel benchmark specifically designed to evaluate LLMs’ ability to understand and track the temporal progression of mental states across interconnected scenarios. Through a systematic four-step framework, we generate 1,100 social contexts encompassing 5,500 scenarios and 78,100 questions, each validated for realism and quality. Our comprehensive evaluation of ten state-of-the-art LLMs reveals that their average performance underperforms humans by 44.7%, with performance degrading significantly when tracking and reasoning about the shift of mental states. This performance gap highlights fundamental limitations in current LLMs’ ability to model the dynamic nature of human mental states.
Search
Fix author
Co-authors
- Pengfei Liu 4
- Jie Sun 3
- Wenjie Li 2
- Jiaxin Ai 1
- Xiaojie Cai 1
- Yi Cheng 1
- Qiwen Deng 1
- Yukang Feng 1
- Dayuan Fu 1
- Lu Han 1
- Kang He 1
- Mohan Jiang 1
- Keyu Li 1
- Chuanhao Li 1
- Zizhen Li 1
- Jifan Lin 1
- Yu Liu 1
- Yiming Liu 1
- Xingyu Lu 1
- Lintao Ma 1
- Rui Ma 1
- Junhao Shi 1
- Xiang Shu 1
- Weiye Si 1
- Changhe Song 1
- Jianwen Sun 1
- Dequan Wang 1
- Jiashuo Wang 1
- Xiang Wang 1
- Yunze Wu 1
- Jiancan Wu 1
- Wenxiao Wu 1
- Shijie Xia 1
- Tianze Xu 1
- Qiancheng Xu 1
- Chunpu Xu 1
- Zelai Yang 1
- Fanrui Zhang 1
- Shenglin Zhang 1
- Kaipeng Zhang 1
- Jun Zhou 1
- Sizhuo Zhou 1