Mahmood Sharif

2026

Safeguarding Language Models via Self-Destruct Trapdoor
Shahar Katz | Bar Alon | Ariel Shaulov | Lior Wolf | Mahmood Sharif
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

The potential misuse and misalignment of language models (LMs) is a central safety concern. This work presents Self-Destruct, a novelmechanism to restrict specific behaviors in LMs by leveraging overlooked properties of the underlying hardware. We observe that the LMframeworks use limited-precision formats (e.g., BF16), which are vulnerable to overflow errors during matrix multiplications. Exploitingthis property, Self-Destruct replaces selected weights in pre-trained LM layers with values that act as traps, triggering a system error onlywhen the model engages in targeted behaviors, such as harmful text generation, while leaving normal functionality unaffected. Unlike posthoc filters, this safeguard is embedded directly within the model, introduces neither inference overhead nor auxiliary models, and requires only a set of examples for calibration. Extensive experiments with five LM families demonstrate that Self-Destruct provides competitive protection against jailbreak attacks while preserving accuracy on standard benchmarks. In addition, we also show that Self-Destruct is versatile, helping mitigate biased text generation and enable model fingerprinting, highlighting the potential of hardware-aware safeguards as an efficient, low-overhead complement to existing LM defenses.

Co-authors

Venues

EACL1

Fix author