Beneath the Facade: Probing Safety Vulnerabilities in LLMs via Auto-Generated Jailbreak Prompts

Heehyeon Kim, Kyeongryul Lee, Joyce Jiyoung Whang


Abstract
The rapid proliferation of large language models and multimodal generative models has raised concerns about their potential vulnerabilities to a wide range of real-world safety risks. However, a critical gap persists in systematic assessment, alongside the lack of evaluation frameworks to keep pace with the breadth and variability of real-world risk factors. In this paper, we introduce TroGEN, an automated jailbreak prompt generation framework that assesses these vulnerabilities by deriving scenario-driven jailbreak prompts using an adversarial agent. Moving beyond labor-intensive dataset construction, TroGEN features an extensible design that covers broad range of risks, supports plug-and-play jailbreak strategies, and adapts seamlessly to multimodal settings. Experimental results demonstrate that TroGEN effectively uncovers safety weaknesses, revealing susceptibilities to adversarial attacks that conceal malicious intent beneath an apparently benign facade, like a Trojan horse. Furthermore, such stealthy attacks exhibit resilience even against existing jailbreak defense methods.
Anthology ID:
2025.findings-emnlp.960
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
17668–17700
Language:
URL:
https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.960/
DOI:
10.18653/v1/2025.findings-emnlp.960
Bibkey:
Cite (ACL):
Heehyeon Kim, Kyeongryul Lee, and Joyce Jiyoung Whang. 2025. Beneath the Facade: Probing Safety Vulnerabilities in LLMs via Auto-Generated Jailbreak Prompts. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 17668–17700, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Beneath the Facade: Probing Safety Vulnerabilities in LLMs via Auto-Generated Jailbreak Prompts (Kim et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.960.pdf
Checklist:
 2025.findings-emnlp.960.checklist.pdf