ToolReflection: Improving Large Language Models for Real-World API Calls with Self-Generated Data
Gregory Polyakov, Ilseyar Alimova, Dmitry Abulkhanov, Ivan Sedykh, Andrey Bout, Sergey Nikolenko, Irina Piontkovskaya
Abstract
While open-source large language models (LLMs) have advanced in leveraging third-party tools, significant challenges remain in real-world API usage, where behavior is unpredictable or poorly specified. Existing benchmarks often fail to capture this complexity. We propose ToolReflection, a novel method that improves LLMs’ ability to self-correct API calls by utilizing real-time API feedback. We also introduce new datasets specifically designed to test model performance under realistic conditions. In ToolReflection, models undergo instruction tuning on a dataset augmented with self-generated errors and corrections. Our evaluation across ToolAlpaca, ToolBench benchmarks, and three newly developed datasets (GPT4Tools-OOD, GPT4Tools-OOD-Hard, and Multistep-100) demonstrates its effectiveness. ToolReflection boosts overall success rates by 25.4% on GPT4Tools-OOD, 56.2% on GPT4Tools-OOD-Hard, and 4% on Multistep-100, outperforming original models. On ToolAlpaca, we show a 14% improvement in the “Simulated” setting and 10.5% in the “Real-world” scenario. Our error analysis highlights ToolReflection significantly enhances recovery from incorrect tool calls, even with incomplete or erroneous API documentation. We have released the code, prompts, and data at https://github.com/polgrisha/ToolReflection.- Anthology ID:
- 2025.realm-1.14
- Volume:
- Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025)
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria
- Editors:
- Ehsan Kamalloo, Nicolas Gontier, Xing Han Lu, Nouha Dziri, Shikhar Murty, Alexandre Lacoste
- Venues:
- REALM | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 184–199
- Language:
- URL:
- https://preview.aclanthology.org/display_plenaries/2025.realm-1.14/
- DOI:
- Cite (ACL):
- Gregory Polyakov, Ilseyar Alimova, Dmitry Abulkhanov, Ivan Sedykh, Andrey Bout, Sergey Nikolenko, and Irina Piontkovskaya. 2025. ToolReflection: Improving Large Language Models for Real-World API Calls with Self-Generated Data. In Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025), pages 184–199, Vienna, Austria. Association for Computational Linguistics.
- Cite (Informal):
- ToolReflection: Improving Large Language Models for Real-World API Calls with Self-Generated Data (Polyakov et al., REALM 2025)
- PDF:
- https://preview.aclanthology.org/display_plenaries/2025.realm-1.14.pdf