From Capabilities to Performance: Evaluating Key Functional Properties of LLM Architectures in Penetration Testing

Lanxiao Huang, Daksh Dave, Tyler Cody, Peter A. Beling, Ming Jin


Abstract
Large Language Models (LLMs) have been explored for automating or enhancing penetration testing tasks, but their effectiveness and reliability across diverse attack phases remain open questions. This study presents a comprehensive evaluation of multiple LLM-based agents, ranging from singular to modular designs, across realistic penetration testing scenarios, analyzing their empirical performance and recurring failure patterns. We further investigate the impact of core functional capabilities on agent success, operationalized through five targeted augmentations: Global Context Memory (GCM), Inter-Agent Messaging (IAM), Context-Conditioned Invocation (CCI), Adaptive Planning (AP), and Real-Time Monitoring (RTM). These interventions respectively support the capabilities of Context Coherence & Retention, Inter-Component Coordination & State Management, Tool Usage Accuracy & Selective Execution, Multi-Step Strategic Planning & Error Detection & Recovery, and Real-Time Dynamic Responsiveness. Our findings reveal that while some architectures natively exhibit select properties, targeted augmentations significantly enhance modular agent performance—particularly in complex, multi-step, and real-time penetration testing scenarios.
Anthology ID:
2025.emnlp-main.802
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
15890–15916
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.802/
DOI:
Bibkey:
Cite (ACL):
Lanxiao Huang, Daksh Dave, Tyler Cody, Peter A. Beling, and Ming Jin. 2025. From Capabilities to Performance: Evaluating Key Functional Properties of LLM Architectures in Penetration Testing. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15890–15916, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
From Capabilities to Performance: Evaluating Key Functional Properties of LLM Architectures in Penetration Testing (Huang et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.802.pdf
Checklist:
 2025.emnlp-main.802.checklist.pdf