Julia Stoyanovich

2025

pdf bib abs
SAFENUDGE: Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs
Joao Fonseca | Andrew Bell | Julia Stoyanovich
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) have been shown to be susceptible to jailbreak attacks, or adversarial attacks used to illicit high risk behavior from a model, highlighting the critical need to safeguard widely-deployed models. Safeguarding approaches, which include fine-tuning models or having LLMs “self-reflect,” may lengthen the inference time of a model, incur a computational penalty, reduce the semantic fluency of an output, and restrict “normal” model behavior. Importantly, these Safety-Performance Trade-offs (SPTs) remain an understudied area. In this work, we make three contributions: (1) We introduce SAFENUDGE, a novel safeguard that combines Controlled Text Generation and “nudging.” SAFENUDGE triggers during text-generation while a jailbreak attack is being executed, and can reduce successful jailbreak attempts by between 28.1% and 37.3% by guiding the LLM towards a safe response. It adds minimal latency to inference and has a negligible impact on the semantic fluency of outputs. Second, it supports tunable SPTs, meaning practitioners can set their own tolerance for trade-offs balancing safety and restrictions to normal model behavior. Third, we release the source code for SAFENUDGE at https://github.com/joaopfonseca/SafeNudge. It is open source and compatible with the HuggingFace transformers library.

Co-authors

Andrew Bell 1
Joao Fonseca 1

Venues

emnlp1

Fix author