ETOM: A Five-Level Benchmark for Evaluating Tool Orchestration within the MCP Ecosystem

Jia-Kai Dong; I-Wei Huang; Chun-Tin Wu; Yi-tien Tsai

ETOM: A Five-Level Benchmark for Evaluating Tool Orchestration within the MCP Ecosystem

Jia-Kai Dong, I-Wei Huang, Chun-Tin Wu, Yi-tien Tsai

Abstract

We introduce ETOM, a five-level benchmark for evaluating multi-hop, end-to-end tool orchestration by LLM agents within a hierarchical Model-Context Protocol (MCP) ecosystem. Existing benchmarks often assess tools in isolation, overlooking challenges such as functional overlap and cross-server orchestration, which can lead to overly optimistic evaluations. ETOM addresses these gaps by constructing ground truth through "equal function sets”, enabling objective metrics such as F1 score and reducing reliance on LLM-as-a-judge evaluation. Its five-level curriculum systematically tests agent capabilities, from single-tool orchestration to complex cross-server planning, as well as robustness to out-of-scope requests. Experiments reveal that rigid hierarchies can hinder performance without co-designed strategies, and even state-of-the-art agents exhibit systemic weaknesses in robustness. ETOM provides a diagnostic framework to expose these limitations and guide the development of more capable and efficient tool-using agents.

Anthology ID:: 2026.findings-eacl.75
Volume:: Findings of the Association for Computational Linguistics: EACL 2026
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1453–1488
Language:
URL:: https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.75/
DOI:
Bibkey:
Cite (ACL):: Jia-Kai Dong, I-Wei Huang, Chun-Tin Wu, and Yi-tien Tsai. 2026. ETOM: A Five-Level Benchmark for Evaluating Tool Orchestration within the MCP Ecosystem. In Findings of the Association for Computational Linguistics: EACL 2026, pages 1453–1488, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: ETOM: A Five-Level Benchmark for Evaluating Tool Orchestration within the MCP Ecosystem (Dong et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.75.pdf
Checklist:: 2026.findings-eacl.75.checklist.pdf

PDF Cite Search Checklist Fix data