ToolRM: Towards Agentic Tool-Use Reward Modeling

Renhao Li; Jianhong Tu; Yang Su; Yantao Liu; Fei Huang; Hamid Alinejad-Rokny; Derek F. Wong (黄辉); Junyang Lin; Min Yang

ToolRM: Towards Agentic Tool-Use Reward Modeling

Renhao Li, Jianhong Tu, Yang Su, Yantao Liu, Fei Huang, Hamid Alinejad-Rokny, Derek F. Wong, Junyang Lin, Min Yang

Abstract

Reward models (RMs) play a critical role in aligning large language models (LLMs) with human preferences. Yet in the domain of tool learning, the lack of RMs specifically designed for function-calling tasks has limited progress toward more capable agentic AI. We introduce ToolRM, a family of lightweight reward models tailored for general tool-use scenarios. To build these models, we propose a novel pipeline that constructs high-quality pairwise preference data using rule-based scoring and multidimensional sampling. This yields ToolPref-Pairwise-30K, a diverse, balanced, and challenging preference dataset that supports both generative and discriminative reward modeling. We also introduce TRBench_BFCL, a benchmark built on the agent evaluation suite BFCL to evaluate RMs on tool calling tasks. Trained on our constructed data, models from the Qwen3-4B/8B series achieve up to 17.94% higher accuracy, substantially outperforming frontier LLMs and RMs in pairwise reward judgments. Beyond training objectives, generative ToolRM generalizes to broader critique tasks, including Best-of-N sampling and self-correction. Experiments on ACEBench highlight its effectiveness and efficiency, enabling inference-time scaling while reducing output token usage by over 66%. Its support for downstream RL training further validates its practical utility. We release data to facilitate future research.

Anthology ID:: 2026.findings-acl.419
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8613–8640
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.419/
DOI:
Bibkey:
Cite (ACL):: Renhao Li, Jianhong Tu, Yang Su, Yantao Liu, Fei Huang, Hamid Alinejad-Rokny, Derek F. Wong, Junyang Lin, and Min Yang. 2026. ToolRM: Towards Agentic Tool-Use Reward Modeling. In Findings of the Association for Computational Linguistics: ACL 2026, pages 8613–8640, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: ToolRM: Towards Agentic Tool-Use Reward Modeling (Li et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.419.pdf
Checklist:: 2026.findings-acl.419.checklist.pdf

PDF Cite Search Checklist Fix data