Yuval Elovici
2025
DIESEL: A Lightweight Inference-Time Safety Enhancement for Language Models
Ben Ganon
|
Alon Zolfi
|
Omer Hofman
|
Inderjeet Singh
|
Hisashi Kojima
|
Yuval Elovici
|
Asaf Shabtai
Findings of the Association for Computational Linguistics: ACL 2025
Large language models (LLMs) have demonstrated impressive performance across a wide range of tasks, including open-ended dialogue, driving advancements in virtual assistants and other interactive systems. However, these models often generate outputs misaligned with human values, such as ethical norms and safety constraints, resulting in potentially harmful or inappropriate responses. While several techniques have been proposed to address this problem, they typically involve computationally intensive training procedures or introduce substantial inference-time latency. In this paper, we present DIESEL, a lightweight inference-guidance technique that can be seamlessly integrated into any autoregressive LLM to semantically filter undesirable content during generation. DIESEL guides generation by reranking token candidates according to their semantic similarity to predefined negative concepts in the latent space. It can serve either as a standalone safeguard or as an auxiliary defense layer, enhancing response safety without requiring model fine-tuning or additional data. We demonstrate DIESEL’s effectiveness on state-of-the-art conversational models, including in adversarial jailbreak scenarios. Furthermore, we show that DIESEL generalizes beyond safety applications, enabling flexible and domain-specific response filtering.
Search
Fix author
Co-authors
- Ben Ganon 1
- Omer Hofman 1
- Hisashi Kojima 1
- Asaf Shabtai 1
- Inderjeet Singh 1
- show all...