2025
pdf
bib
abs
ActionStudio: A Lightweight Framework for Data and Training of Large Action Models
Jianguo Zhang
|
Thai Quoc Hoang
|
Ming Zhu
|
Zuxin Liu
|
Shiyu Wang
|
Tulika Manoj Awalgaonkar
|
Akshara Prabhakar
|
Haolin Chen
|
Weiran Yao
|
Zhiwei Liu
|
Juntao Tan
|
Juan Carlos Niebles
|
Shelby Heinecke
|
Huan Wang
|
Silvio Savarese
|
Caiming Xiong
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large Action models are essential for enabling autonomous agents to perform complex tasks. However, training such models remains challenging due to the diversity of agent environments and the complexity of noisy agentic data. Existing infrastructure offers limited support for scalable, agent-specific fine-tuning and standardized agent data processing. We introduce ActionStudio, a lightweight and extensible data and training framework designed for large action models. ActionStudio unifies diverse agent trajectories using our proposed Unified Format 2.0, supports a range of training workflows with optimized multi-node distributed setup, and integrates robust preprocessing and real-time verification tools. ActionStudio demonstrates up to 9× higher throughput compared to existing agentic training frameworks, and our trained models yield top performances across public and realistic agent benchmarks. To support the broader research community, we open-source the ActionStudio framework and release actionstudio-98k, a curated dataset of 98k high-quality trajectories.
pdf
bib
abs
MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models
Zhiwei Liu
|
Jielin Qiu
|
Shiyu Wang
|
Jianguo Zhang
|
Zuxin Liu
|
Roshan Ram
|
Haolin Chen
|
Weiran Yao
|
Shelby Heinecke
|
Silvio Savarese
|
Huan Wang
|
Caiming Xiong
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
The rapid adoption of Large Language Models (LLMs) as intelligent agents has underscored the necessity for robust evaluation frameworks capable of assessing agent performance in realistic, interactive environments. Existing evaluation methodologies often suffer from limitations such as static task benchmarks, limited scope, and inadequate integration with practical applications. In response, we introduce MCPEval, an open-source, Model Context Protocol (MCP)-based evaluation framework specifically tailored for comprehensive and systematic assessment of LLM-powered agents. MCPEval standardizes evaluations across diverse domains through automated task generation and verification, supports multiple performance metrics, and integrates seamlessly with native agent capabilities. We empirically validate the effectiveness of MCPEval across five distinct real-world domains, highlighting significant variations in performance across various LLM architectures and prompting strategies. Our results illustrate the framework’s capacity to uncover nuanced performance patterns and identify domain-specific strengths and weaknesses, providing valuable insights beyond traditional binary success metrics. We publicly release MCPEval to foster reproducible research and promote standardized evaluation practices within the LLM agent community.
pdf
bib
abs
SlackAgents: Scalable Collaboration of AI Agents in Workspaces
Zhiwei Liu
|
Weiran Yao
|
Zuxin Liu
|
Juntao Tan
|
Jianguo Zhang
|
Frank Wang
|
Sukhandeep Nahal
|
Huan Wang
|
Shelby Heinecke
|
Silvio Savarese
|
Caiming Xiong
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
In today’s rapidly evolving business landscape, organizations are turning to AI agents to automate tasks, streamline business operations, and improve decision-making processes. However, despite the flexibility offered by existing libraries, the developed agents often struggle with integration into organizational workflows, resulting in limited daily usage for work. In this paper, we present SlackAgents, a multi-agent library for scalable management and collaboration of AI agents on Slack. As an agentic layer developed upon the Slack platform, the framework offers instant AI integration into organizational workflows and enables AI-powered automation of real daily tasks. Furthermore, SLACKAGENTS facilitates scalable collaboration, allowing for effective communication and task orchestration. Our solution bridges existing gaps, offering a robust platform for developing, deploying and managing AI agents for workplace environments.
pdf
bib
abs
PersonaBench: Evaluating AI Models on Understanding Personal Information through Accessing (Synthetic) Private User Data
Juntao Tan
|
Liangwei Yang
|
Zuxin Liu
|
Zhiwei Liu
|
Rithesh R N
|
Tulika Manoj Awalgaonkar
|
Jianguo Zhang
|
Weiran Yao
|
Ming Zhu
|
Shirley Kokane
|
Silvio Savarese
|
Huan Wang
|
Caiming Xiong
|
Shelby Heinecke
Findings of the Association for Computational Linguistics: ACL 2025
Personalization is essential for AI assistants, especially in private AI settings where models are expected to interpret users’ personal data (e.g., conversations, app usage) to understand their background, preferences, and social context. However, due to privacy concerns, existing academic research lacks direct access to such data, making benchmarking difficult. To fill this gap, we propose a synthetic data pipeline that generates realistic user profiles and private documents, enabling the creation of PersonaBench—a benchmark for evaluating models’ ability to understand personal information. Using this benchmark, we assess Retrieval-Augmented Generation (RAG) pipelines on personalized questions and find that current models struggle to accurately extract and answer questions even when provided with the full set of user documents, highlighting the need for improved personalization methods.
pdf
bib
abs
LAM SIMULATOR: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback
Thai Quoc Hoang
|
Kung-Hsiang Huang
|
Shirley Kokane
|
Jianguo Zhang
|
Zuxin Liu
|
Ming Zhu
|
Jake Grigsby
|
Tian Lan
|
Michael S Ryoo
|
Chien-Sheng Wu
|
Shelby Heinecke
|
Huan Wang
|
Silvio Savarese
|
Caiming Xiong
|
Juan Carlos Niebles
Findings of the Association for Computational Linguistics: ACL 2025
Large Action Models (LAMs) for AI Agents offer incredible potential but face challenges due to the need for high-quality training data, especially for multi-steps tasks that involve planning, executing tool calls, and responding to feedback. To address these issues, we present LAM SIMULATOR, a comprehensive framework designed for online exploration of agentic tasks with high-quality feedback. Our framework features a dynamic task query generator, an extensive collection of tools, and an interactive environment where Large Language Model (LLM) Agents can call tools and receive real-time feedback. This setup enables LLM Agents to explore and solve tasks autonomously, facilitating the discovery of multiple approaches to tackle any given task. The resulting action trajectory data are then used to create high-quality training datasets for LAMs. Our experiments on popular agentic benchmarks, ToolBench and CRMArena, highlight the effectiveness of LAM SIMULATOR: models trained with self-generated datasets using our framework achieve significant performance gains, up to a 49.3% improvement over their original baselines. LAM SIMULATOR requires minimal human input during dataset creation, highlighting LAM SIMULATOR’s efficiency and effectiveness in speeding up development of AI agents.
pdf
bib
abs
Your Language Model May Think Too Rigidly: Achieving Reasoning Consistency with Symmetry-Enhanced Training
Yihang Yao
|
Zhepeng Cen
|
Miao Li
|
William Han
|
Yuyou Zhang
|
Emerson Liu
|
Zuxin Liu
|
Chuang Gan
|
Ding Zhao
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models (LLMs) have demonstrated strong reasoning capabilities across various tasks. However, even minor variations in query phrasing, despite preserving the underlying semantic meaning, can significantly affect their performance. To address this, we focus on enhancing LLMs’ awareness of symmetry in query variations and propose syMmetry-ENhanceD (MEND) data augmentation, a data-centric approach that improves the model’s ability to extract useful information from context. Unlike existing methods that emphasize reasoning chain augmentation, our approach improves model robustness at the knowledge extraction stage through query augmentation, enabling more data-efficient training and stronger generalization to Out-of-Distribution (OOD) settings. Extensive experiments on both logical and arithmetic reasoning tasks show that MEND enhances reasoning performance across diverse query variations, providing new insights into improving LLM robustness through structured dataset curation.
pdf
bib
abs
xLAM: A Family of Large Action Models to Empower AI Agent Systems
Jianguo Zhang
|
Tian Lan
|
Ming Zhu
|
Zuxin Liu
|
Thai Quoc Hoang
|
Shirley Kokane
|
Weiran Yao
|
Juntao Tan
|
Akshara Prabhakar
|
Haolin Chen
|
Zhiwei Liu
|
Yihao Feng
|
Tulika Manoj Awalgaonkar
|
Rithesh R N
|
Zeyuan Chen
|
Ran Xu
|
Juan Carlos Niebles
|
Shelby Heinecke
|
Huan Wang
|
Silvio Savarese
|
Caiming Xiong
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Autonomous agents powered by large language models (LLMs) have attracted significant research interest. However, the open-source community faces many challenges in developing specialized models for agent tasks, driven by the scarcity of high-quality agent datasets and the absence of standard protocols in this area. We introduce xLAM, a series of large action models designed for AI agent tasks. The xLAM series includes five models with both dense and mixture-of-expert architectures, ranging from 1B to 8x22B parameters, trained using a scalable, flexible pipeline that unifies, augments, and synthesizes diverse datasets to enhance AI agents’ generalizability and performance across varied environments. Our experimental results demonstrate that xLAM consistently delivers exceptional performance across multiple agent ability benchmarks, notably securing the 1st position on the Berkeley Function-Calling Leaderboard, outperforming GPT-4, Claude-3, and many other models in terms of tool use. By releasing the xLAM series, we aim to advance the performance of open-source LLMs for autonomous AI agents, potentially accelerating progress and democratizing access to high-performance models for agent tasks.
2024
pdf
bib
abs
PRACT: Optimizing Principled Reasoning and Acting of LLM Agent
Zhiwei Liu
|
Weiran Yao
|
Jianguo Zhang
|
Zuxin Liu
|
Liangwei Yang
|
Rithesh R N
|
Tian Lan
|
Ming Zhu
|
Juntao Tan
|
Shirley Kokane
|
Thai Quoc Hoang
|
Juan Carlos Niebles
|
Shelby Heinecke
|
Huan Wang
|
Silvio Savarese
|
Caiming Xiong
Proceedings of the 28th Conference on Computational Natural Language Learning
We introduce the Principled Reasoning and Acting (PRAct) framework, a novel method for learning and enforcing action principles from trajectory data. Central to our approach is the use of text gradients from a reflection and optimization engine to derive these action principles. To adapt action principles to specific task requirements, we propose a new optimization framework, Reflective Principle Optimization (RPO). After execution, RPO employs a reflector to critique current action principles and an optimizer to update them accordingly.We investigate the RPO framework under two scenarios: Reward-RPO, which uses environmental rewards for reflection, and Self-RPO, which conducts self-reflection without external rewards. Additionally, we developed two RPO methods, RPO-Traj and RPO-Batch, to adapt to different settings.Experimental results across four environments demonstrate that the PRAct agent, leveraging the RPO framework, can effectively learn and apply action principles to enhance performance.