Safe-Unsafe Concept Separation Emerges from a Single Direction in Language Models Activation Space
Andrea Ermellino, Lorenzo Malandri, Fabio Mercorio, Antonio Serino
Abstract
Ensuring the safety of Large Language Models (LLMs) is a critical alignment challenge. Existing approaches often rely on invasive fine- tuning or external generation-based checks, which can be opaque and resource-inefficient. In this work, we investigate the geometry of safety concepts within pretrained representations, proposing a mechanistic methodology that identifies the layer where safe and unsafe concepts are maximally separable within a pretrained model’s representation space. By leveraging the intrinsic activation space of the optimal layer, we show that safety enforcement can be achieved via a simple linear classifier, avoiding the need for weight modification. We validate our framework across multiple domains (regulation, law, finance, cybersecurity, education, code, human resources, and social media), diverse tasks (safety classification, prompt injection, and toxicity detection), and 16 non-English languages on both encoder and decoder architectures. Our results show that: (i) the separation between safe and unsafe concepts emerges from a single layer direction in the activation space, (ii) monitoring internal representations provides a significantly more robust safeguarding mechanism compared to traditional evaluative or generative guardrail paradigms.- Anthology ID:
- 2026.eacl-long.139
- Volume:
- Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- March
- Year:
- 2026
- Address:
- Rabat, Morocco
- Editors:
- Vera Demberg, Kentaro Inui, Lluís Marquez
- Venue:
- EACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 3019–3034
- Language:
- URL:
- https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.139/
- DOI:
- Cite (ACL):
- Andrea Ermellino, Lorenzo Malandri, Fabio Mercorio, and Antonio Serino. 2026. Safe-Unsafe Concept Separation Emerges from a Single Direction in Language Models Activation Space. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3019–3034, Rabat, Morocco. Association for Computational Linguistics.
- Cite (Informal):
- Safe-Unsafe Concept Separation Emerges from a Single Direction in Language Models Activation Space (Ermellino et al., EACL 2026)
- PDF:
- https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.139.pdf