On Guardrail Models’ Robustness to Mutations and Adversarial Attacks

Elias Bassani; Ignacio Sanchez

doi:10.18653/v1/2025.findings-emnlp.922

On Guardrail Models’ Robustness to Mutations and Adversarial Attacks

Abstract

The risk of generative AI systems providing unsafe information has raised significant concerns, emphasizing the need for safety guardrails. To mitigate this risk, guardrail models are increasingly used to detect unsafe content in human-AI interactions, complementing the safety alignment of Large Language Models. Despite recent efforts to evaluate those models’ effectiveness, their robustness to input mutations and adversarial attacks remains largely unexplored. In this paper, we present a comprehensive evaluation of 15 state-of-the-art guardrail models, assessing their robustness to: a) input mutations, such as typos, keywords camouflage, ciphers, and veiled expressions, and b) adversarial attacks designed to bypass models’ safety alignment. Those attacks exploit LLMs capabilities like instruction-following, role-playing, personification, reasoning, and coding, or introduce adversarial tokens to induce model misbehavior. Our results reveal that most guardrail models can be evaded with simple input mutations and are vulnerable to adversarial attacks. For instance, a single adversarial token can deceive them 44.5% of the time on average. The limitations of the current generation of guardrail models highlight the need for more robust safety guardrails.

Anthology ID:: 2025.findings-emnlp.922
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 16995–17006
Language:
URL:: https://preview.aclanthology.org/ingest-luhme/2025.findings-emnlp.922/
DOI:: 10.18653/v1/2025.findings-emnlp.922
Bibkey:
Cite (ACL):: Elias Bassani and Ignacio Sanchez. 2025. On Guardrail Models’ Robustness to Mutations and Adversarial Attacks. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 16995–17006, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: On Guardrail Models’ Robustness to Mutations and Adversarial Attacks (Bassani & Sanchez, Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-luhme/2025.findings-emnlp.922.pdf
Checklist:: 2025.findings-emnlp.922.checklist.pdf

PDF Cite Search Checklist Fix data