Thomas Holleis
2025
ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities
Jiarui Lu
|
Thomas Holleis
|
Yizhe Zhang
|
Bernhard Aumayer
|
Feng Nan
|
Haoping Bai
|
Shuang Ma
|
Shen Ma
|
Mengyu Li
|
Guoli Yin
|
Zirui Wang
|
Ruoming Pang
Findings of the Association for Computational Linguistics: NAACL 2025
Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation and a dynamic evaluation strategy for intermediate and final milestones over arbitrary trajectory. We show that open source and proprietary models has a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs, providing brand-new insights to tool-use LLM capabilities. Datasets and evaluation scripts of ToolSandbox are released at <placeholder>.
Search
Fix data
Co-authors
- Bernhard Aumayer 1
- Haoping Bai 1
- Mengyu Li 1
- Jiarui Lu 1
- Shuang Ma 1
- show all...