2025
pdf
bib
abs
ToolReflection: Improving Large Language Models for Real-World API Calls with Self-Generated Data
Gregory Polyakov
|
Ilseyar Alimova
|
Dmitry Abulkhanov
|
Ivan Sedykh
|
Andrey Bout
|
Sergey Nikolenko
|
Irina Piontkovskaya
Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025)
While open-source large language models (LLMs) have advanced in leveraging third-party tools, significant challenges remain in real-world API usage, where behavior is unpredictable or poorly specified. Existing benchmarks often fail to capture this complexity. We propose ToolReflection, a novel method that improves LLMs’ ability to self-correct API calls by utilizing real-time API feedback. We also introduce new datasets specifically designed to test model performance under realistic conditions. In ToolReflection, models undergo instruction tuning on a dataset augmented with self-generated errors and corrections. Our evaluation across ToolAlpaca, ToolBench benchmarks, and three newly developed datasets (GPT4Tools-OOD, GPT4Tools-OOD-Hard, and Multistep-100) demonstrates its effectiveness. ToolReflection boosts overall success rates by 25.4% on GPT4Tools-OOD, 56.2% on GPT4Tools-OOD-Hard, and 4% on Multistep-100, outperforming original models. On ToolAlpaca, we show a 14% improvement in the “Simulated” setting and 10.5% in the “Real-world” scenario. Our error analysis highlights ToolReflection significantly enhances recovery from incorrect tool calls, even with incomplete or erroneous API documentation. We have released the code, prompts, and data at https://github.com/polgrisha/ToolReflection.
2024
pdf
bib
abs
Toolken+: Improving LLM Tool Usage with Reranking and a Reject Option
Konstantin Yakovlev
|
Sergey Nikolenko
|
Andrey Bout
Findings of the Association for Computational Linguistics: EMNLP 2024
The recently proposed ToolkenGPT tool learning paradigm demonstrates promising performance but suffers from two major issues: first, it cannot benefit from tool documentation, and second, it often makes mistakes in whether to use a tool at all. We introduce Toolken+ that mitigates the first problem by reranking top-k tools selected by ToolkenGPT and the second problem with a special REJECT option such that the model will generate a vocabulary token if REJECT is ranked first. We demonstrate the effectiveness of Toolken+ on multistep numerical reasoning and tool selection tasks.
2023
pdf
bib
abs
GEC-DePenD: Non-Autoregressive Grammatical Error Correction with Decoupled Permutation and Decoding
Konstantin Yakovlev
|
Alexander Podolskiy
|
Andrey Bout
|
Sergey Nikolenko
|
Irina Piontkovskaya
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Grammatical error correction (GEC) is an important NLP task that is currently usually solved with autoregressive sequence-to-sequence models. However, approaches of this class are inherently slow due to one-by-one token generation, so non-autoregressive alternatives are needed. In this work, we propose a novel non-autoregressive approach to GEC that decouples the architecture into a permutation network that outputs a self-attention weight matrix that can be used in beam search to find the best permutation of input tokens (with auxiliary <ins> tokens) and a decoder network based on a step-unrolled denoising autoencoder that fills in specific tokens. This allows us to find the token permutation after only one forward pass of the permutation network, avoiding autoregressive constructions. We show that the resulting network improves over previously known non-autoregressive methods for GEC and reaches the level of autoregressive methods that do not use language-specific synthetic data generation methods. Our results are supported by a comprehensive experimental validation on the ConLL-2014 and BEA datasets and an extensive ablation study that supports our architectural and algorithmic choices.
pdf
bib
abs
Efficient Grammatical Error Correction Via Multi-Task Training and Optimized Training Schedule
Andrey Bout
|
Alexander Podolskiy
|
Sergey Nikolenko
|
Irina Piontkovskaya
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Progress in neural grammatical error correction (GEC) is hindered by the lack of annotated training data. Sufficient amounts of high-quality manually annotated data are not available, so recent research has relied on generating synthetic data, pretraining on it, and then fine-tuning on real datasets; performance gains have been achieved either by ensembling or by using huge pretrained models such as XXL-T5 as the backbone. In this work, we explore an orthogonal direction: how to use available data more efficiently. First, we propose auxiliary tasks that exploit the alignment between the original and corrected sentences, such as predicting a sequence of corrections. We formulate each task as a sequence-to-sequence problem and perform multi-task training. Second, we discover that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance, so we set out to find the best training schedule. Together, these two ideas lead to significant improvements, producing results that improve state of the art with much smaller models; in particular, we outperform the best models based on T5-XXL (11B parameters) with a BART-based model (400M parameters).
2022
pdf
bib
Template-based Approach to Zero-shot Intent Recognition
Dmitry Lamanov
|
Pavel Burnyshev
|
Ekaterina Artemova
|
Valentin Malykh
|
Andrey Bout
|
Irina Piontkovskaya
Proceedings of the 15th International Conference on Natural Language Generation
2021
pdf
bib
abs
Single Example Can Improve Zero-Shot Data Generation
Pavel Burnyshev
|
Valentin Malykh
|
Andrey Bout
|
Ekaterina Artemova
|
Irina Piontkovskaya
Proceedings of the 14th International Conference on Natural Language Generation
Sub-tasks of intent classification, such as robustness to distribution shift, adaptation to specific user groups and personalization, out-of-domain detection, require extensive and flexible datasets for experiments and evaluation. As collecting such datasets is time- and labor-consuming, we propose to use text generation methods to gather datasets. The generator should be trained to generate utterances that belong to the given intent. We explore two approaches to the generation of task-oriented utterances: in the zero-shot approach, the model is trained to generate utterances from seen intents and is further used to generate utterances for intents unseen during training. In the one-shot approach, the model is presented with a single utterance from a test intent. We perform a thorough automatic, and human evaluation of the intrinsic properties of two-generation approaches. The attributes of the generated data are close to original test sets, collected via crowd-sourcing.
pdf
bib
abs
InFoBERT: Zero-Shot Approach to Natural Language Understanding Using Contextualized Word Embedding
Pavel Burnyshev
|
Andrey Bout
|
Valentin Malykh
|
Irina Piontkovskaya
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Natural language understanding is an important task in modern dialogue systems. It becomes more important with the rapid extension of the dialogue systems’ functionality. In this work, we present an approach to zero-shot transfer learning for the tasks of intent classification and slot-filling based on pre-trained language models. We use deep contextualized models feeding them with utterances and natural language descriptions of user intents to get text embeddings. These embeddings then used by a small neural network to produce predictions for intent and slot probabilities. This architecture achieves new state-of-the-art results in two zero-shot scenarios. One is a single language new skill adaptation and another one is a cross-lingual adaptation.