Yuval Elovici

2025

Large language models (LLMs) have demonstrated impressive performance across a wide range of tasks, including open-ended dialogue, driving advancements in virtual assistants and other interactive systems. However, these models often generate outputs misaligned with human values, such as ethical norms and safety constraints, resulting in potentially harmful or inappropriate responses. While several techniques have been proposed to address this problem, they typically involve computationally intensive training procedures or introduce substantial inference-time latency. In this paper, we present DIESEL, a lightweight inference-guidance technique that can be seamlessly integrated into any autoregressive LLM to semantically filter undesirable content during generation. DIESEL guides generation by reranking token candidates according to their semantic similarity to predefined negative concepts in the latent space. It can serve either as a standalone safeguard or as an auxiliary defense layer, enhancing response safety without requiring model fine-tuning or additional data. We demonstrate DIESEL’s effectiveness on state-of-the-art conversational models, including in adversarial jailbreak scenarios. Furthermore, we show that DIESEL generalizes beyond safety applications, enabling flexible and domain-specific response filtering.

Co-authors

Alon Zolfi 1

Venues

findings1

Fix author