Safe-Unsafe Concept Separation Emerges from a Single Direction in Language Models Activation Space

Andrea Ermellino; Lorenzo Malandri; Fabio Mercorio; Antonio Serino

Safe-Unsafe Concept Separation Emerges from a Single Direction in Language Models Activation Space

Andrea Ermellino, Lorenzo Malandri, Fabio Mercorio, Antonio Serino

Abstract

Ensuring the safety of Large Language Models (LLMs) is a critical alignment challenge. Existing approaches often rely on invasive fine- tuning or external generation-based checks, which can be opaque and resource-inefficient. In this work, we investigate the geometry of safety concepts within pretrained representations, proposing a mechanistic methodology that identifies the layer where safe and unsafe concepts are maximally separable within a pretrained model’s representation space. By leveraging the intrinsic activation space of the optimal layer, we show that safety enforcement can be achieved via a simple linear classifier, avoiding the need for weight modification. We validate our framework across multiple domains (regulation, law, finance, cybersecurity, education, code, human resources, and social media), diverse tasks (safety classification, prompt injection, and toxicity detection), and 16 non-English languages on both encoder and decoder architectures. Our results show that: (i) the separation between safe and unsafe concepts emerges from a single layer direction in the activation space, (ii) monitoring internal representations provides a significantly more robust safeguarding mechanism compared to traditional evaluative or generative guardrail paradigms.

Anthology ID:: 2026.eacl-long.139
Volume:: Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3019–3034
Language:
URL:: https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.139/
DOI:
Bibkey:
Cite (ACL):: Andrea Ermellino, Lorenzo Malandri, Fabio Mercorio, and Antonio Serino. 2026. Safe-Unsafe Concept Separation Emerges from a Single Direction in Language Models Activation Space. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3019–3034, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Safe-Unsafe Concept Separation Emerges from a Single Direction in Language Models Activation Space (Ermellino et al., EACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.139.pdf

PDF Cite Search Fix data