Safeguarding Language Models via Self-Destruct Trapdoor

Shahar Katz; Bar Alon; Ariel Shaulov; Lior Wolf; Mahmood Sharif

Safeguarding Language Models via Self-Destruct Trapdoor

Shahar Katz, Bar Alon, Ariel Shaulov, Lior Wolf, Mahmood Sharif

Abstract

The potential misuse and misalignment of language models (LMs) is a central safety concern. This work presents Self-Destruct, a novelmechanism to restrict specific behaviors in LMs by leveraging overlooked properties of the underlying hardware. We observe that the LMframeworks use limited-precision formats (e.g., BF16), which are vulnerable to overflow errors during matrix multiplications. Exploitingthis property, Self-Destruct replaces selected weights in pre-trained LM layers with values that act as traps, triggering a system error onlywhen the model engages in targeted behaviors, such as harmful text generation, while leaving normal functionality unaffected. Unlike posthoc filters, this safeguard is embedded directly within the model, introduces neither inference overhead nor auxiliary models, and requires only a set of examples for calibration. Extensive experiments with five LM families demonstrate that Self-Destruct provides competitive protection against jailbreak attacks while preserving accuracy on standard benchmarks. In addition, we also show that Self-Destruct is versatile, helping mitigate biased text generation and enable model fingerprinting, highlighting the potential of hardware-aware safeguards as an efficient, low-overhead complement to existing LM defenses.

Anthology ID:: 2026.eacl-long.326
Volume:: Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6939–6958
Language:
URL:: https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.326/
DOI:
Bibkey:
Cite (ACL):: Shahar Katz, Bar Alon, Ariel Shaulov, Lior Wolf, and Mahmood Sharif. 2026. Safeguarding Language Models via Self-Destruct Trapdoor. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6939–6958, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Safeguarding Language Models via Self-Destruct Trapdoor (Katz et al., EACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.326.pdf

PDF Cite Search Fix data