Gregory Polyakov
2025
ToolReflection: Improving Large Language Models for Real-World API Calls with Self-Generated Data
Gregory Polyakov
|
Ilseyar Alimova
|
Dmitry Abulkhanov
|
Ivan Sedykh
|
Andrey Bout
|
Sergey Nikolenko
|
Irina Piontkovskaya
Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025)
While open-source large language models (LLMs) have advanced in leveraging third-party tools, significant challenges remain in real-world API usage, where behavior is unpredictable or poorly specified. Existing benchmarks often fail to capture this complexity. We propose ToolReflection, a novel method that improves LLMs’ ability to self-correct API calls by utilizing real-time API feedback. We also introduce new datasets specifically designed to test model performance under realistic conditions. In ToolReflection, models undergo instruction tuning on a dataset augmented with self-generated errors and corrections. Our evaluation across ToolAlpaca, ToolBench benchmarks, and three newly developed datasets (GPT4Tools-OOD, GPT4Tools-OOD-Hard, and Multistep-100) demonstrates its effectiveness. ToolReflection boosts overall success rates by 25.4% on GPT4Tools-OOD, 56.2% on GPT4Tools-OOD-Hard, and 4% on Multistep-100, outperforming original models. On ToolAlpaca, we show a 14% improvement in the “Simulated” setting and 10.5% in the “Real-world” scenario. Our error analysis highlights ToolReflection significantly enhances recovery from incorrect tool calls, even with incomplete or erroneous API documentation. We have released the code, prompts, and data at https://github.com/polgrisha/ToolReflection.
Search
Fix author
Co-authors
- Dmitry Abulkhanov 1
- Ilseyar Alimova 1
- Andrey Bout 1
- Sergey Nikolenko 1
- Irina Piontkovskaya 1
- show all...