StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models

Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, Yang Liu


Abstract
Large Language Models (LLMs) have witnessed remarkable advancements in recent years, prompting the exploration of tool learning, which integrates LLMs with external tools to address diverse real-world challenges. Assessing the capability of LLMs to utilise tools necessitates large-scale and stable benchmarks. However, previous works relied on either hand-crafted online tools with limited scale, or large-scale real online APIs suffering from instability of API status. To address this problem, we introduce StableToolBench, a benchmark evolving from ToolBench, proposing a virtual API server and stable evaluation system. The virtual API server contains a caching system and API simulators which are complementary to alleviate the change in API status. Meanwhile, the stable evaluation system designs solvable pass and win rates using GPT-4 as the automatic evaluator to eliminate the randomness during evaluation. Experimental results demonstrate the stability of StableToolBench, and further discuss the effectiveness of API simulators, the caching system, and the evaluator system.
Anthology ID:
2024.findings-acl.664
Volume:
Findings of the Association for Computational Linguistics ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand and virtual meeting
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11143–11156
Language:
URL:
https://aclanthology.org/2024.findings-acl.664
DOI:
10.18653/v1/2024.findings-acl.664
Bibkey:
Cite (ACL):
Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and Yang Liu. 2024. StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models. In Findings of the Association for Computational Linguistics ACL 2024, pages 11143–11156, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models (Guo et al., Findings 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-5/2024.findings-acl.664.pdf