AutoPenBench: A Vulnerability Testing Benchmark for Generative Agents

Luca Gioacchini, Alexander Delsanto, Idilio Drago, Marco Mellia, Giuseppe Siracusano, Roberto Bifulco


Abstract
LLM agents show promise for vulnerability testing. We however lack benchmarks to evaluate and compare solutions. AutoPenBench covers this need offering an open benchmark for the evaluation of vulnerability testing agents. It includes 33 tasks, ranging from introductory exercises to actual vulnerable systems. It supports MCP, enabling the comparison of agent capabilities. We introduce milestones per task, allowing the comparison of intermediate steps where agents struggle. To illustrate the use of AutoPenBench we evaluate autonomous and human-assisted agent architectures. The former achieves 21% success rates, insufficient for production, while human-assisted agents reach 64% success, indicating a viable industrial path. AutoPenBench is offered as open source and enables fair comparison of agents.
Anthology ID:
2025.emnlp-industry.114
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:
November
Year:
2025
Address:
Suzhou (China)
Editors:
Saloni Potdar, Lina Rojas-Barahona, Sebastien Montella
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1615–1624
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-industry.114/
DOI:
Bibkey:
Cite (ACL):
Luca Gioacchini, Alexander Delsanto, Idilio Drago, Marco Mellia, Giuseppe Siracusano, and Roberto Bifulco. 2025. AutoPenBench: A Vulnerability Testing Benchmark for Generative Agents. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1615–1624, Suzhou (China). Association for Computational Linguistics.
Cite (Informal):
AutoPenBench: A Vulnerability Testing Benchmark for Generative Agents (Gioacchini et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-industry.114.pdf