Live API-Bench: 2500+ Live APIs for Testing Multi-Step Tool Calling

Benjamin Elder, Anupama Murthi, Jungkoo Kang, Ankita Naik, Kinjal Basu, Kiran Kate, Danish Contractor


Abstract
Large language models (LLMs) increasingly rely on external tools and APIs to execute complex tasks specified in natural language. Evaluating such tool-calling capabilities in realistic enterprise settings is challenging: APIs are often proprietary, heterogeneous, and difficult to share, limiting reproducible benchmarks. To address this, we introduce Live API Bench, a comprehensive benchmark constructed by transforming NL2SQL datasets into interactive API environments. Our pipeline converts SQL queries from BIRD-SQL into executable API sequences across three formulations—SLOT, SEL, and REST—covering minimal general-purpose operations, domain-specific multi-step tasks, and function-oriented RESTful interactions, respectively. The benchmark spans 11 databases with over 2,500 invocable tools, paired with human-authored queries, ground-truth API sequences, and verified final answers. Live API Bench enables systematic evaluation of core challenges in tool use, including error handling, sequential reasoning, parameter generation, response parsing, and robustness across diverse domains. We evaluate 10 LLMs and 4 ReACT agents, observing low task completion rates (7–47%), which improve modestly to 50% under interactive agent settings, highlighting substantial scope for improving LLM tool-calling performance. We release all code and data associated with this paper.
Anthology ID:
2026.eacl-long.143
Volume:
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3092–3124
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.143/
DOI:
Bibkey:
Cite (ACL):
Benjamin Elder, Anupama Murthi, Jungkoo Kang, Ankita Naik, Kinjal Basu, Kiran Kate, and Danish Contractor. 2026. Live API-Bench: 2500+ Live APIs for Testing Multi-Step Tool Calling. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3092–3124, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Live API-Bench: 2500+ Live APIs for Testing Multi-Step Tool Calling (Elder et al., EACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.143.pdf