ToolReflection: Improving Large Language Models for Real-World API Calls with Self-Generated Data

Gregory Polyakov; Ilseyar Alimova; Dmitry Abulkhanov; Ivan Sedykh; Andrey Bout; Sergey Nikolenko; Irina Piontkovskaya

ToolReflection: Improving Large Language Models for Real-World API Calls with Self-Generated Data

Gregory Polyakov, Ilseyar Alimova, Dmitry Abulkhanov, Ivan Sedykh, Andrey Bout, Sergey Nikolenko, Irina Piontkovskaya

Abstract

While open-source large language models (LLMs) have advanced in leveraging third-party tools, significant challenges remain in real-world API usage, where behavior is unpredictable or poorly specified. Existing benchmarks often fail to capture this complexity. We propose ToolReflection, a novel method that improves LLMs’ ability to self-correct API calls by utilizing real-time API feedback. We also introduce new datasets specifically designed to test model performance under realistic conditions. In ToolReflection, models undergo instruction tuning on a dataset augmented with self-generated errors and corrections. Our evaluation across ToolAlpaca, ToolBench benchmarks, and three newly developed datasets (GPT4Tools-OOD, GPT4Tools-OOD-Hard, and Multistep-100) demonstrates its effectiveness. ToolReflection boosts overall success rates by 25.4% on GPT4Tools-OOD, 56.2% on GPT4Tools-OOD-Hard, and 4% on Multistep-100, outperforming original models. On ToolAlpaca, we show a 14% improvement in the “Simulated” setting and 10.5% in the “Real-world” scenario. Our error analysis highlights ToolReflection significantly enhances recovery from incorrect tool calls, even with incomplete or erroneous API documentation. We have released the code, prompts, and data at https://github.com/polgrisha/ToolReflection.

Anthology ID:: 2025.realm-1.14
Volume:: Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Ehsan Kamalloo, Nicolas Gontier, Xing Han Lu, Nouha Dziri, Shikhar Murty, Alexandre Lacoste
Venues:: REALM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 184–199
Language:
URL:: https://preview.aclanthology.org/display_plenaries/2025.realm-1.14/
DOI:
Bibkey:
Cite (ACL):: Gregory Polyakov, Ilseyar Alimova, Dmitry Abulkhanov, Ivan Sedykh, Andrey Bout, Sergey Nikolenko, and Irina Piontkovskaya. 2025. ToolReflection: Improving Large Language Models for Real-World API Calls with Self-Generated Data. In Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025), pages 184–199, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: ToolReflection: Improving Large Language Models for Real-World API Calls with Self-Generated Data (Polyakov et al., REALM 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/display_plenaries/2025.realm-1.14.pdf

PDF Cite Search Fix data