The Adaptive Interrogator: Detecting Trojan LLMs in Multi-Agent Systems via Evolved Conversational Strategies
Rana Muhammad Shahroz Khan, Ruichen Zhang, Zhen Tan, Charles Fleming, Tianlong Chen
Abstract
While Large Language Model (LLM) safety has focused on single-agent, white-box settings, the adoption of Multi-Agent Systems (MAS) creates a critical blind spot: supply chain vulnerabilities in MAS ecosystems. These systems often rely on third-party agents accessed via black-box APIs, creating risks where attackers can embed hidden triggers to manipulate collective reasoning or outputs. Because internal weights are inaccessible, traditional white-box defenses fail to detect these threats. Consequently, a critical gap exists in auditing these systems for ”Trojan” agents, i.e., malicious models that behave normally until triggered by specific, often multi-turn, conversational contexts. To bridge this gap, we introduce the Conversational Trojan Unmasking System (CTUS), a black-box auditing framework that leverages an Evolutionary Algorithm (EA) to autonomously expose hidden threats. Drawing on social deduction mechanics, CTUS deploys a ”Judge” agent to evolve conversational probes that provoke Trojan agents into revealing their malicious nature without alerting benign peers. We validate CTUS across diverse architectures (Llama-2/3, Gemma, Mistral) and attack vectors (word, syntax, semantic, RLHF). Our results demonstrate that CTUS achieves superior detection rates (up to 100% in specific configurations). Furthermore, we conduct rigorous analyses to confirm the framework’s robustness, exhibiting negligible false positives on benign systems and stability across system configurations, establishing CTUS as a scalable safeguard for the multi-agent landscape.- Anthology ID:
- 2026.findings-acl.1348
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 27029–27044
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1348/
- DOI:
- Cite (ACL):
- Rana Muhammad Shahroz Khan, Ruichen Zhang, Zhen Tan, Charles Fleming, and Tianlong Chen. 2026. The Adaptive Interrogator: Detecting Trojan LLMs in Multi-Agent Systems via Evolved Conversational Strategies. In Findings of the Association for Computational Linguistics: ACL 2026, pages 27029–27044, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- The Adaptive Interrogator: Detecting Trojan LLMs in Multi-Agent Systems via Evolved Conversational Strategies (Khan et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1348.pdf