Rana Muhammad Shahroz Khan


2026

While Large Language Model (LLM) safety has focused on single-agent, white-box settings, the adoption of Multi-Agent Systems (MAS) creates a critical blind spot: supply chain vulnerabilities in MAS ecosystems. These systems often rely on third-party agents accessed via black-box APIs, creating risks where attackers can embed hidden triggers to manipulate collective reasoning or outputs. Because internal weights are inaccessible, traditional white-box defenses fail to detect these threats. Consequently, a critical gap exists in auditing these systems for ”Trojan” agents, i.e., malicious models that behave normally until triggered by specific, often multi-turn, conversational contexts. To bridge this gap, we introduce the Conversational Trojan Unmasking System (CTUS), a black-box auditing framework that leverages an Evolutionary Algorithm (EA) to autonomously expose hidden threats. Drawing on social deduction mechanics, CTUS deploys a ”Judge” agent to evolve conversational probes that provoke Trojan agents into revealing their malicious nature without alerting benign peers. We validate CTUS across diverse architectures (Llama-2/3, Gemma, Mistral) and attack vectors (word, syntax, semantic, RLHF). Our results demonstrate that CTUS achieves superior detection rates (up to 100% in specific configurations). Furthermore, we conduct rigorous analyses to confirm the framework’s robustness, exhibiting negligible false positives on benign systems and stability across system configurations, establishing CTUS as a scalable safeguard for the multi-agent landscape.