2025
pdf
bib
abs
PersonaBench: Evaluating AI Models on Understanding Personal Information through Accessing (Synthetic) Private User Data
Juntao Tan
|
Liangwei Yang
|
Zuxin Liu
|
Zhiwei Liu
|
Rithesh R N
|
Tulika Manoj Awalgaonkar
|
Jianguo Zhang
|
Weiran Yao
|
Ming Zhu
|
Shirley Kokane
|
Silvio Savarese
|
Huan Wang
|
Caiming Xiong
|
Shelby Heinecke
Findings of the Association for Computational Linguistics: ACL 2025
Personalization is essential for AI assistants, especially in private AI settings where models are expected to interpret users’ personal data (e.g., conversations, app usage) to understand their background, preferences, and social context. However, due to privacy concerns, existing academic research lacks direct access to such data, making benchmarking difficult. To fill this gap, we propose a synthetic data pipeline that generates realistic user profiles and private documents, enabling the creation of PersonaBench—a benchmark for evaluating models’ ability to understand personal information. Using this benchmark, we assess Retrieval-Augmented Generation (RAG) pipelines on personalized questions and find that current models struggle to accurately extract and answer questions even when provided with the full set of user documents, highlighting the need for improved personalization methods.
pdf
bib
abs
LAM SIMULATOR: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback
Thai Quoc Hoang
|
Kung-Hsiang Huang
|
Shirley Kokane
|
Jianguo Zhang
|
Zuxin Liu
|
Ming Zhu
|
Jake Grigsby
|
Tian Lan
|
Michael S Ryoo
|
Chien-Sheng Wu
|
Shelby Heinecke
|
Huan Wang
|
Silvio Savarese
|
Caiming Xiong
|
Juan Carlos Niebles
Findings of the Association for Computational Linguistics: ACL 2025
Large Action Models (LAMs) for AI Agents offer incredible potential but face challenges due to the need for high-quality training data, especially for multi-steps tasks that involve planning, executing tool calls, and responding to feedback. To address these issues, we present LAM SIMULATOR, a comprehensive framework designed for online exploration of agentic tasks with high-quality feedback. Our framework features a dynamic task query generator, an extensive collection of tools, and an interactive environment where Large Language Model (LLM) Agents can call tools and receive real-time feedback. This setup enables LLM Agents to explore and solve tasks autonomously, facilitating the discovery of multiple approaches to tackle any given task. The resulting action trajectory data are then used to create high-quality training datasets for LAMs. Our experiments on popular agentic benchmarks, ToolBench and CRMArena, highlight the effectiveness of LAM SIMULATOR: models trained with self-generated datasets using our framework achieve significant performance gains, up to a 49.3% improvement over their original baselines. LAM SIMULATOR requires minimal human input during dataset creation, highlighting LAM SIMULATOR’s efficiency and effectiveness in speeding up development of AI agents.
pdf
bib
abs
xLAM: A Family of Large Action Models to Empower AI Agent Systems
Jianguo Zhang
|
Tian Lan
|
Ming Zhu
|
Zuxin Liu
|
Thai Quoc Hoang
|
Shirley Kokane
|
Weiran Yao
|
Juntao Tan
|
Akshara Prabhakar
|
Haolin Chen
|
Zhiwei Liu
|
Yihao Feng
|
Tulika Manoj Awalgaonkar
|
Rithesh R N
|
Zeyuan Chen
|
Ran Xu
|
Juan Carlos Niebles
|
Shelby Heinecke
|
Huan Wang
|
Silvio Savarese
|
Caiming Xiong
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Autonomous agents powered by large language models (LLMs) have attracted significant research interest. However, the open-source community faces many challenges in developing specialized models for agent tasks, driven by the scarcity of high-quality agent datasets and the absence of standard protocols in this area. We introduce xLAM, a series of large action models designed for AI agent tasks. The xLAM series includes five models with both dense and mixture-of-expert architectures, ranging from 1B to 8x22B parameters, trained using a scalable, flexible pipeline that unifies, augments, and synthesizes diverse datasets to enhance AI agents’ generalizability and performance across varied environments. Our experimental results demonstrate that xLAM consistently delivers exceptional performance across multiple agent ability benchmarks, notably securing the 1st position on the Berkeley Function-Calling Leaderboard, outperforming GPT-4, Claude-3, and many other models in terms of tool use. By releasing the xLAM series, we aim to advance the performance of open-source LLMs for autonomous AI agents, potentially accelerating progress and democratizing access to high-performance models for agent tasks.
2024
pdf
bib
abs
PRACT: Optimizing Principled Reasoning and Acting of LLM Agent
Zhiwei Liu
|
Weiran Yao
|
Jianguo Zhang
|
Zuxin Liu
|
Liangwei Yang
|
Rithesh R N
|
Tian Lan
|
Ming Zhu
|
Juntao Tan
|
Shirley Kokane
|
Thai Quoc Hoang
|
Juan Carlos Niebles
|
Shelby Heinecke
|
Huan Wang
|
Silvio Savarese
|
Caiming Xiong
Proceedings of the 28th Conference on Computational Natural Language Learning
We introduce the Principled Reasoning and Acting (PRAct) framework, a novel method for learning and enforcing action principles from trajectory data. Central to our approach is the use of text gradients from a reflection and optimization engine to derive these action principles. To adapt action principles to specific task requirements, we propose a new optimization framework, Reflective Principle Optimization (RPO). After execution, RPO employs a reflector to critique current action principles and an optimizer to update them accordingly.We investigate the RPO framework under two scenarios: Reward-RPO, which uses environmental rewards for reflection, and Self-RPO, which conducts self-reflection without external rewards. Additionally, we developed two RPO methods, RPO-Traj and RPO-Batch, to adapt to different settings.Experimental results across four environments demonstrate that the PRAct agent, leveraging the RPO framework, can effectively learn and apply action principles to enhance performance.
pdf
bib
abs
DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI
Jianguo Zhang
|
Kun Qian
|
Zhiwei Liu
|
Shelby Heinecke
|
Rui Meng
|
Ye Liu
|
Zhou Yu
|
Huan Wang
|
Silvio Savarese
|
Caiming Xiong
Findings of the Association for Computational Linguistics: EACL 2024
Despite advancements in conversational AI, language models encounter challenges to handle diverse conversational tasks, and existing dialogue dataset collections often lack diversity and comprehensiveness. To tackle these issues, we introduce DialogStudio: the largest and most diverse collection of dialogue datasets, unified under a consistent format while preserving their original information. Our collection encompasses data from open-domain dialogues, task-oriented dialogues, natural language understanding, conversational recommendation, dialogue summarization, and knowledge-grounded dialogues, making it an incredibly rich and diverse resource for dialogue research and model training.To further enhance the utility of DialogStudio, we identify the licenses for each dataset, design external knowledge and domain-aware prompts for selected dialogues to facilitate instruction-aware fine-tuning. To improve transparency and support dataset and task-based research, as well as language model pre-training, all datasets, licenses, codes, and models associated with DialogStudio will be made publicly accessible.
2023
pdf
bib
abs
Enhancing Performance on Seen and Unseen Dialogue Scenarios using Retrieval-Augmented End-to-End Task-Oriented System
Jianguo Zhang
|
Stephen Roller
|
Kun Qian
|
Zhiwei Liu
|
Rui Meng
|
Shelby Heinecke
|
Huan Wang
|
Silvio Savarese
|
Caiming Xiong
Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue
End-to-end task-oriented dialogue (TOD) systems have achieved promising performance by leveraging sophisticated natural language understanding and natural language generation capabilities of pre-trained models. This work enables the TOD systems with more flexibility through a simple cache. The cache provides the flexibility to dynamically update the TOD systems and handle both existing and unseen dialogue scenarios. Towards this end, we first fine-tune a retrieval module to effectively retrieve the most relevant information entries from the cache. We then train end-to-end TOD models that can refer to and ground on both dialogue history and retrieved information during TOD generation. The introduced cache is straightforward to construct, and the backbone models of TOD systems are compatible with existing pre-trained generative models. Extensive experiments demonstrate the superior performance of our framework, with a notable improvement in non-empty joint goal accuracy by 6.7% compared to strong baselines.