Ignacio Sanchez
2025
On Guardrail Models’ Robustness to Mutations and Adversarial Attacks
Elias Bassani
|
Ignacio Sanchez
Findings of the Association for Computational Linguistics: EMNLP 2025
The risk of generative AI systems providing unsafe information has raised significant concerns, emphasizing the need for safety guardrails. To mitigate this risk, guardrail models are increasingly used to detect unsafe content in human-AI interactions, complementing the safety alignment of Large Language Models. Despite recent efforts to evaluate those models’ effectiveness, their robustness to input mutations and adversarial attacks remains largely unexplored. In this paper, we present a comprehensive evaluation of 15 state-of-the-art guardrail models, assessing their robustness to: a) input mutations, such as typos, keywords camouflage, ciphers, and veiled expressions, and b) adversarial attacks designed to bypass models’ safety alignment. Those attacks exploit LLMs capabilities like instruction-following, role-playing, personification, reasoning, and coding, or introduce adversarial tokens to induce model misbehavior. Our results reveal that most guardrail models can be evaded with simple input mutations and are vulnerable to adversarial attacks. For instance, a single adversarial token can deceive them 44.5% of the time on average. The limitations of the current generation of guardrail models highlight the need for more robust safety guardrails.
2024
GuardBench: A Large-Scale Benchmark for Guardrail Models
Elias Bassani
|
Ignacio Sanchez
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Generative AI systems powered by Large Language Models have become increasingly popular in recent years. Lately, due to the risk of providing users with unsafe information, the adoption of those systems in safety-critical domains has raised significant concerns. To respond to this situation, input-output filters, commonly called guardrail models, have been proposed to complement other measures, such as model alignment. Unfortunately, the lack of a standard benchmark for guardrail models poses significant evaluation issues and makes it hard to compare results across scientific publications. To fill this gap, we introduce GuardBench, a large-scale benchmark for guardrail models comprising 40 safety evaluation datasets. To facilitate the adoption of GuardBench, we release a Python library providing an automated evaluation pipeline built on top of it. With our benchmark, we also share the first large-scale prompt moderation datasets in German, French, Italian, and Spanish. To assess the current state-of-the-art, we conduct an extensive comparison of recent guardrail models and show that a general-purpose instruction-following model of comparable size achieves competitive results without the need for specific fine-tuning.