SAFENUDGE: Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs

Joao Fonseca; Andrew Bell; Julia Stoyanovich

SAFENUDGE: Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs

Joao Fonseca, Andrew Bell, Julia Stoyanovich

Abstract

Large Language Models (LLMs) have been shown to be susceptible to jailbreak attacks, or adversarial attacks used to illicit high risk behavior from a model, highlighting the critical need to safeguard widely-deployed models. Safeguarding approaches, which include fine-tuning models or having LLMs “self-reflect,” may lengthen the inference time of a model, incur a computational penalty, reduce the semantic fluency of an output, and restrict “normal” model behavior. Importantly, these Safety-Performance Trade-offs (SPTs) remain an understudied area. In this work, we make three contributions: (1) We introduce SAFENUDGE, a novel safeguard that combines Controlled Text Generation and “nudging.” SAFENUDGE triggers during text-generation while a jailbreak attack is being executed, and can reduce successful jailbreak attempts by between 28.1% and 37.3% by guiding the LLM towards a safe response. It adds minimal latency to inference and has a negligible impact on the semantic fluency of outputs. Second, it supports tunable SPTs, meaning practitioners can set their own tolerance for trade-offs balancing safety and restrictions to normal model behavior. Third, we release the source code for SAFENUDGE at https://github.com/joaopfonseca/SafeNudge. It is open source and compatible with the HuggingFace transformers library.

Anthology ID:: 2025.emnlp-main.1010
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 19966–19980
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1010/
DOI:
Bibkey:
Cite (ACL):: Joao Fonseca, Andrew Bell, and Julia Stoyanovich. 2025. SAFENUDGE: Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19966–19980, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: SAFENUDGE: Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs (Fonseca et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1010.pdf
Checklist:: 2025.emnlp-main.1010.checklist.pdf

PDF Cite Search Checklist Fix data