Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents

Anh Ta; Junjie Zhu; Shahin Shayandeh

Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents

Abstract

Tool-calling agents are evaluated on tool selection, parameter accuracy, and scope recognition, yet LLM trajectory assessments remain inherently *post-hoc*. Disconnected from the active execution loop, such assessments identify errors that are usually addressed through prompt-tuning or retraining, and fundamentally cannot course-correct the agent in real time. To close this gap, we move evaluation into the execution loop at *inference time*: a specialized reviewer agent evaluates provisional tool calls *prior to* execution, shifting the paradigm from post-hoc recovery to proactive evaluation and error mitigation.In practice, this architecture establishes a clear separation of concerns between the primary execution agent and a secondary review agent. As with any multi-agent system, the reviewer can introduce new errors while correcting others, yet no prior work to our knowledge has systematically measured this tradeoff. To quantify this tradeoff, we introduce *Helpfulness-Harmfulness metrics*: helpfulness measures the percentage of base agent errors that feedback corrects; harmfulness measures the percentage of correct responses that feedback degrades. These metrics directly inform reviewer design by revealing whether a given model or prompt provides net positive value.We evaluate our approach on BFCL (single-turn) and 𝜏²-Bench (multi-turn stateful scenarios), achieving +5.5% on irrelevance detection and +7.1% on multi-turn tasks. Our metrics reveal that reviewer model choice is critical: the reasoning model o3-mini achieves a 3:1 benefit-to-risk ratio versus 2.1:1 for GPT-4o. Automated prompt optimization via GEPA provides an additional +1.5–2.8%. Together, these results demonstrate a core advantage of separating execution and review: the reviewer can be systematically improved through model selection and prompt optimization, without retraining the base agent.

Anthology ID:: 2026.gem-main.13
Volume:: Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
Venues:: GEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 136–147
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.13/
DOI:
Bibkey:
Cite (ACL):: Anh Ta, Junjie Zhu, and Shahin Shayandeh. 2026. Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 136–147, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents (Ta et al., GEM 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.13.pdf

PDF Cite Search Fix data