Roshan Ram
2025
MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models
Zhiwei Liu
|
Jielin Qiu
|
Shiyu Wang
|
Jianguo Zhang
|
Zuxin Liu
|
Roshan Ram
|
Haolin Chen
|
Weiran Yao
|
Shelby Heinecke
|
Silvio Savarese
|
Huan Wang
|
Caiming Xiong
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
The rapid adoption of Large Language Models (LLMs) as intelligent agents has underscored the necessity for robust evaluation frameworks capable of assessing agent performance in realistic, interactive environments. Existing evaluation methodologies often suffer from limitations such as static task benchmarks, limited scope, and inadequate integration with practical applications. In response, we introduce MCPEval, an open-source, Model Context Protocol (MCP)-based evaluation framework specifically tailored for comprehensive and systematic assessment of LLM-powered agents. MCPEval standardizes evaluations across diverse domains through automated task generation and verification, supports multiple performance metrics, and integrates seamlessly with native agent capabilities. We empirically validate the effectiveness of MCPEval across five distinct real-world domains, highlighting significant variations in performance across various LLM architectures and prompting strategies. Our results illustrate the framework’s capacity to uncover nuanced performance patterns and identify domain-specific strengths and weaknesses, providing valuable insights beyond traditional binary success metrics. We publicly release MCPEval to foster reproducible research and promote standardized evaluation practices within the LLM agent community.
Search
Fix author
Co-authors
- Haolin Chen 1
- Shelby Heinecke 1
- Zhiwei Liu 1
- Zuxin Liu 1
- Jielin Qiu 1
- show all...