Alexander Delsanto
2025
AutoPenBench: A Vulnerability Testing Benchmark for Generative Agents
Luca Gioacchini
|
Alexander Delsanto
|
Idilio Drago
|
Marco Mellia
|
Giuseppe Siracusano
|
Roberto Bifulco
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
LLM agents show promise for vulnerability testing. We however lack benchmarks to evaluate and compare solutions. AutoPenBench covers this need offering an open benchmark for the evaluation of vulnerability testing agents. It includes 33 tasks, ranging from introductory exercises to actual vulnerable systems. It supports MCP, enabling the comparison of agent capabilities. We introduce milestones per task, allowing the comparison of intermediate steps where agents struggle. To illustrate the use of AutoPenBench we evaluate autonomous and human-assisted agent architectures. The former achieves 21% success rates, insufficient for production, while human-assisted agents reach 64% success, indicating a viable industrial path. AutoPenBench is offered as open source and enables fair comparison of agents.