2025
pdf
bib
abs
ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities
Jiarui Lu
|
Thomas Holleis
|
Yizhe Zhang
|
Bernhard Aumayer
|
Feng Nan
|
Haoping Bai
|
Shuang Ma
|
Shen Ma
|
Mengyu Li
|
Guoli Yin
|
Zirui Wang
|
Ruoming Pang
Findings of the Association for Computational Linguistics: NAACL 2025
Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation and a dynamic evaluation strategy for intermediate and final milestones over arbitrary trajectory. We show that open source and proprietary models has a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs, providing brand-new insights to tool-use LLM capabilities. Datasets and evaluation scripts of ToolSandbox are released at <placeholder>.
pdf
bib
abs
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains
Guoli Yin
|
Haoping Bai
|
Shuang Ma
|
Feng Nan
|
Yanchao Sun
|
Zhaoyang Xu
|
Shen Ma
|
Jiarui Lu
|
Xiang Kong
|
Aonan Zhang
|
Dian Ang Yap
|
Yizhe Zhang
|
Karsten Ahnert
|
Vik Kamath
|
Mathias Berglund
|
Dominic Walsh
|
Tobias Gindele
|
Juergen Wiest
|
Zhengfeng Lai
|
Xiaoming Simon Wang
|
Jiulong Shan
|
Meng Cao
|
Ruoming Pang
|
Zirui Wang
Findings of the Association for Computational Linguistics: NAACL 2025
Recent advances in large language models (LLMs) have increased the demand for comprehensive benchmarks to evaluate their capabilities as human-like agents. Existing benchmarks, while useful, often focus on specific application scenarios, emphasizing task completion but failing to dissect the underlying skills that drive these outcomes. This lack of granularity makes it difficult to deeply discern where failures stem from. Additionally, setting up these environments requires considerable effort, and issues of unreliability and reproducibility sometimes arise, especially in interactive tasks. To address these limitations, we introduce the Massive Multitask Agent Understanding (MMAU) benchmark, featuring comprehensive offline tasks that eliminate the need for complex environment setups. It evaluate models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics, and covering five essential capabilities: Understanding, Reasoning, Planning, Problem-solving, and Self-correction. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents. By testing 20 representative models on MMAU, we provide deep and insightful analyses. Ultimately, MMAU not only sheds light on the capabilities and limitations of LLM agents but also enhances the interpretability of their performance.
2024
pdf
bib
abs
Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation
Aiwei Liu
|
Haoping Bai
|
Zhiyun Lu
|
Xiang Kong
|
Xiaoming Wang
|
Jiulong Shan
|
Meng Cao
|
Lijie Wen
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Aligning large language models (LLMs) with human expectations without human-annotated preference data is an important problem. In this paper, we propose a method to evaluate the response preference by using the output probabilities of response pairs under contrastive prompt pairs, which could achieve better performance on LLaMA2-7B and LLaMA2-13B compared to RLAIF. Based on this, we propose an automatic alignment method, Direct Large Model Alignment (DLMA). First, we use contrastive prompt pairs to automatically generate preference data. Then, we continue to evaluate the generated preference data using contrastive prompt pairs and calculate a self-rewarding score. Finally, we use the DPO algorithm to effectively align LLMs by combining this self-rewarding score. In the experimental stage, our DLMA method could surpass the RLHF method without relying on human-annotated preference data.