UniToolBench: A Benchmark for Tool-Augmented LLMs in Cross-Domain, Universal Task Automation

Xiaojie Guo; Yang Zhang; Bing Zhang; Ryo Kawahara; Mikio Takeuchi; Yada Zhu

UniToolBench: A Benchmark for Tool-Augmented LLMs in Cross-Domain, Universal Task Automation

Xiaojie Guo, Yang Zhang, Bing Zhang, Ryo Kawahara, Mikio Takeuchi, Yada Zhu

Abstract

Recent advancements in Large Language Models (LLMs) have enabled autonomous agents to decompose complex tasks, select appropriate tools, and execute structured workflows. However, a key challenge in this field is the lack of a universal, large-scale, and cross-domain benchmark to systematically evaluate LLMs’ ability to reason over and utilize interconnected tools for automation. Existing benchmarks, such as TaskBench, focus on manually curated tool graphs for benchmark generation, which lack scalability and diversity across domains. To address this, we propose UniToolBench, a benchmark that incorporates automated tool graph construction by formulating link prediction as a probabilistic task, instead of relying on categorical LLM outputs. Furthermore, we introduce a confidence-based beam search sampling strategy to select high-confidence tool dependencies, ensuring more structured and semantically coherent subgraphs for evaluation. Through extensive experiments on multiple datasets, we demonstrate that while LLMs show promise in tool selection, significant challenges remain in parameter prediction and handling complex tool dependencies.

Anthology ID:: 2026.findings-eacl.248
Volume:: Findings of the Association for Computational Linguistics: EACL 2026
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4726–4736
Language:
URL:: https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.248/
DOI:
Bibkey:
Cite (ACL):: Xiaojie Guo, Yang Zhang, Bing Zhang, Ryo Kawahara, Mikio Takeuchi, and Yada Zhu. 2026. UniToolBench: A Benchmark for Tool-Augmented LLMs in Cross-Domain, Universal Task Automation. In Findings of the Association for Computational Linguistics: EACL 2026, pages 4726–4736, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: UniToolBench: A Benchmark for Tool-Augmented LLMs in Cross-Domain, Universal Task Automation (Guo et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.248.pdf
Checklist:: 2026.findings-eacl.248.checklist.pdf

PDF Cite Search Checklist Fix data