Wang Xinzhi
2025
ACEBench: A Comprehensive Evaluation of LLM Tool Usage
Chen Chen
|
Xinlong Hao
|
Weiwen Liu
|
Xu Huang
|
Xingshan Zeng
|
Shuai Yu
|
Dexun Li
|
Yuefeng Huang
|
Xiangcheng Liu
|
Wang Xinzhi
|
Wu Liu
Findings of the Association for Computational Linguistics: EMNLP 2025
Large Language Models (LLMs) have demonstrated significant potential in decision-making and reasoning, particularly when integrated with various tools to effectively solve complex problems. However, existing benchmarks for evaluating LLMs’ tool usage face several limitations: (1) limited evaluation scenarios, often lacking assessments in real multi-turn dialogue contexts; (2) narrow evaluation dimensions, with insufficient detailed assessments of how LLMs use tools; and (3) reliance on LLMs or real API executions for evaluation, which introduces significant overhead. To address these challenges, we introduce ACEBench, a comprehensive benchmark for assessing tool usage in LLMs. ACEBench categorizes data into three primary types based on evaluation methodology: Normal, Special, and Agent. “Normal” evaluates tool usage in basic scenarios; “Special” evaluates tool usage in situations with ambiguous or incomplete instructions; “Agent” evaluates tool usage through multi-agent interactions to simulate real-world, multi-turn dialogues. We conducted extensive experiments using ACEBench, analyzing various LLMs in-depth and providing a more granular examination of error causes across different data types.
Search
Fix author
Co-authors
- Chen Chen 1
- Xinlong Hao 1
- Xu Huang 1
- Yuefeng Huang 1
- Dexun Li 1
- show all...