Shijie Xia
2026
AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts
Keyu Li | Junhao Shi | Yang Xiao | Mohan Jiang | Jie Sun | Yunze Wu | Dayuan Fu | Shijie Xia | Xiaojie Cai | Tianze Xu | Weiye Si | Wenjie Li | Dequan Wang | Pengfei Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Keyu Li | Junhao Shi | Yang Xiao | Mohan Jiang | Jie Sun | Yunze Wu | Dayuan Fu | Shijie Xia | Xiaojie Cai | Tianze Xu | Weiye Si | Wenjie Li | Dequan Wang | Pengfei Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture long-horizon real-world scenarios. Moreover, the reliance on human-in-the-loop feedback for realistic tasks creates a scalability bottleneck, hindering automated rollout collection and evaluation. To bridge this gap, we introduce AgencyBench, a comprehensive benchmark derived from daily AI usage, evaluating 6 core agentic capabilities across 32 real-world scenarios, comprising 138 tasks with specific queries, deliverables, and rubrics. These scenarios require an average of 90 tool calls, 1 million tokens, and hours of execution time to resolve. To enable automated evaluation, we employ a user simulation agent to provide iterative feedback, and a Docker sandbox to conduct visual and functional rubric-based assessment. Experiments reveal that closed-source models significantly outperform open-source models (48.4% vs 32.1%). Further analysis reveals significant disparities across models in resource efficiency, feedback-driven self-correction, and specific tool-use preferences.
2024
SAFETY-J: Evaluating Safety with Critique
Yixiu Liu | Yuxiang Zheng | Shijie Xia | Jiajun Li | Yi Tu | Chaoling Song | Pengfei Liu
Findings of the Association for Computational Linguistics: EMNLP 2024
Yixiu Liu | Yuxiang Zheng | Shijie Xia | Jiajun Li | Yi Tu | Chaoling Song | Pengfei Liu
Findings of the Association for Computational Linguistics: EMNLP 2024
The deployment of Large Language Models (LLMs) in content generation raises significant safety concerns, particularly regarding the transparency and interpretability of content evaluations. Current methods, primarily focused on binary safety classifications, lack mechanisms for detailed critique, limiting their utility for model improvement and user trust. To address these limitations, we introduce SAFETY-J, a bilingual generative safety evaluator for English and Chinese with critique-based judgment. SAFETY-J utilizes a robust training dataset that includes diverse dialogues and augmented query-response pairs to assess safety across various scenarios comprehensively. We establish an automated meta-evaluation benchmark that objectively assesses the quality of critiques with minimal human intervention, facilitating scalable and continuous improvement. Additionally, SAFETY-Jemploys an iterative preference learning technique to dynamically refine safety assessments based on meta-evaluations and critiques. Our evaluations demonstrate that SAFETY-J provides more nuanced and accurate safety evaluations, thereby enhancing both critique quality and predictive reliability in complex content scenarios. To facilitate further research and application, we have released SAFETY-J’s training protocols, datasets, and code at https://github.com/GAIR-NLP/Safety-J.