MASEval: Extending Multi-Agent Evaluation from Models to Systems

Cornelius Emde; Alexander Rubinstein; Anmol Goel; Ahmed Heakl; Sangdoo Yun; Seong Joon Oh; Martin Gubri

MASEval: Extending Multi-Agent Evaluation from Models to Systems

Cornelius Emde, Alexander Rubinstein, Anmol Goel, Ahmed Heakl, Sangdoo Yun, Seong Joon Oh, Martin Gubri

Abstract

The rapid adoption of LLM-based agentic systems has produced a rich ecosystem of frameworks (smolagents, LangGraph, AutoGen, CAMEL, LlamaIndex, i.a.). Yet many existing benchmarks are model-centric: they fix the agentic setup and do not compare other system components. We argue that implementation decisions substantially impact performance, including choices such as topology, orchestration logic, and error handling. MASEval addresses this evaluation gap with a Python library that treats the entire agentic system as the unit of analysis. Important design decisions such as harness and context engineering are first-class citizens. MASEval helps practitioners identify the best implementation for their use case and researchers systematically study agentic systems, opening new avenues for principled system design. Through the first systematic system-level comparison across 3 benchmarks, 3 models, and 3 frameworks, we find that, across models of comparable cost and capability, framework choice matters as much as model choice. MASEval is available under the MIT licence at https://github.com/maseval/MASEval.

Anthology ID:: 2026.acl-demo.34
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Greg Durrett, Ping Jian
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 345–356
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-demo.34/
DOI:
Bibkey:
Cite (ACL):: Cornelius Emde, Alexander Rubinstein, Anmol Goel, Ahmed Heakl, Sangdoo Yun, Seong Joon Oh, and Martin Gubri. 2026. MASEval: Extending Multi-Agent Evaluation from Models to Systems. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 345–356, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: MASEval: Extending Multi-Agent Evaluation from Models to Systems (Emde et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-demo.34.pdf

PDF Cite Search Fix data