Roshan Ram

2025

The rapid adoption of Large Language Models (LLMs) as intelligent agents has underscored the necessity for robust evaluation frameworks capable of assessing agent performance in realistic, interactive environments. Existing evaluation methodologies often suffer from limitations such as static task benchmarks, limited scope, and inadequate integration with practical applications. In response, we introduce MCPEval, an open-source, Model Context Protocol (MCP)-based evaluation framework specifically tailored for comprehensive and systematic assessment of LLM-powered agents. MCPEval standardizes evaluations across diverse domains through automated task generation and verification, supports multiple performance metrics, and integrates seamlessly with native agent capabilities. We empirically validate the effectiveness of MCPEval across five distinct real-world domains, highlighting significant variations in performance across various LLM architectures and prompting strategies. Our results illustrate the framework’s capacity to uncover nuanced performance patterns and identify domain-specific strengths and weaknesses, providing valuable insights beyond traditional binary success metrics. We publicly release MCPEval to foster reproducible research and promote standardized evaluation practices within the LLM agent community.

Co-authors

Silvio Savarese 1

Venues

emnlp1

Fix author