Robust Safety Classifier Against Jailbreaking Attacks: Adversarial Prompt Shield

Jinhwa Kim, Ali Derakhshan, Ian Harris


Abstract
Large Language Models’ safety remains a critical concern due to their vulnerability to jailbreaking attacks, which can prompt these systems to produce harmful and malicious responses. Safety classifiers, computational models trained to discern and mitigate potentially harmful, offensive, or unethical outputs, offer a practical solution to address this issue. However, despite their potential, existing safety classifiers often fail when exposed to adversarial attacks such as gradient-optimized suffix attacks. In response, our study introduces Adversarial Prompt Shield (APS), a lightweight safety classifier model that excels in detection accuracy and demonstrates resilience against unseen jailbreaking prompts. We also introduce efficiently generated adversarial training datasets, named Bot Adversarial Noisy Dialogue (BAND), which are designed to fortify the classifier’s robustness. Through extensive testing on various safety tasks and unseen jailbreaking attacks, we demonstrate the effectiveness and resilience of our models. Evaluations show that our classifier has the potential to significantly reduce the Attack Success Rate by up to 44.9%. This advance paves the way for the next generation of more reliable and resilient Large Language Models.
Anthology ID:
2024.woah-1.12
Volume:
Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Yi-Ling Chung, Zeerak Talat, Debora Nozza, Flor Miriam Plaza-del-Arco, Paul Röttger, Aida Mostafazadeh Davani, Agostina Calabrese
Venues:
WOAH | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
159–170
Language:
URL:
https://aclanthology.org/2024.woah-1.12
DOI:
Bibkey:
Cite (ACL):
Jinhwa Kim, Ali Derakhshan, and Ian Harris. 2024. Robust Safety Classifier Against Jailbreaking Attacks: Adversarial Prompt Shield. In Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024), pages 159–170, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Robust Safety Classifier Against Jailbreaking Attacks: Adversarial Prompt Shield (Kim et al., WOAH-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/jeptaln-2024-ingestion/2024.woah-1.12.pdf