From Capabilities to Performance: Evaluating Key Functional Properties of LLM Architectures in Penetration Testing
Lanxiao Huang, Daksh Dave, Tyler Cody, Peter A. Beling, Ming Jin
Abstract
Large Language Models (LLMs) have been explored for automating or enhancing penetration testing tasks, but their effectiveness and reliability across diverse attack phases remain open questions. This study presents a comprehensive evaluation of multiple LLM-based agents, ranging from singular to modular designs, across realistic penetration testing scenarios, analyzing their empirical performance and recurring failure patterns. We further investigate the impact of core functional capabilities on agent success, operationalized through five targeted augmentations: Global Context Memory (GCM), Inter-Agent Messaging (IAM), Context-Conditioned Invocation (CCI), Adaptive Planning (AP), and Real-Time Monitoring (RTM). These interventions respectively support the capabilities of Context Coherence & Retention, Inter-Component Coordination & State Management, Tool Usage Accuracy & Selective Execution, Multi-Step Strategic Planning & Error Detection & Recovery, and Real-Time Dynamic Responsiveness. Our findings reveal that while some architectures natively exhibit select properties, targeted augmentations significantly enhance modular agent performance—particularly in complex, multi-step, and real-time penetration testing scenarios.- Anthology ID:
- 2025.emnlp-main.802
- Volume:
- Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 15890–15916
- Language:
- URL:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.802/
- DOI:
- Cite (ACL):
- Lanxiao Huang, Daksh Dave, Tyler Cody, Peter A. Beling, and Ming Jin. 2025. From Capabilities to Performance: Evaluating Key Functional Properties of LLM Architectures in Penetration Testing. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15890–15916, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- From Capabilities to Performance: Evaluating Key Functional Properties of LLM Architectures in Penetration Testing (Huang et al., EMNLP 2025)
- PDF:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.802.pdf