MASEval: Extending Multi-Agent Evaluation from Models to Systems
Cornelius Emde, Alexander Rubinstein, Anmol Goel, Ahmed Heakl, Sangdoo Yun, Seong Joon Oh, Martin Gubri
Abstract
The rapid adoption of LLM-based agentic systems has produced a rich ecosystem of frameworks (smolagents, LangGraph, AutoGen, CAMEL, LlamaIndex, i.a.). Yet many existing benchmarks are model-centric: they fix the agentic setup and do not compare other system components. We argue that implementation decisions substantially impact performance, including choices such as topology, orchestration logic, and error handling. MASEval addresses this evaluation gap with a Python library that treats the entire agentic system as the unit of analysis. Important design decisions such as harness and context engineering are first-class citizens. MASEval helps practitioners identify the best implementation for their use case and researchers systematically study agentic systems, opening new avenues for principled system design. Through the first systematic system-level comparison across 3 benchmarks, 3 models, and 3 frameworks, we find that, across models of comparable cost and capability, framework choice matters as much as model choice. MASEval is available under the MIT licence at https://github.com/maseval/MASEval.- Anthology ID:
- 2026.acl-demo.34
- Volume:
- Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Greg Durrett, Ping Jian
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 345–356
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.acl-demo.34/
- DOI:
- Cite (ACL):
- Cornelius Emde, Alexander Rubinstein, Anmol Goel, Ahmed Heakl, Sangdoo Yun, Seong Joon Oh, and Martin Gubri. 2026. MASEval: Extending Multi-Agent Evaluation from Models to Systems. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 345–356, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- MASEval: Extending Multi-Agent Evaluation from Models to Systems (Emde et al., ACL 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.acl-demo.34.pdf