Understanding Emergent Misalignment via Feature Superposition Geometry

Gouki Minegishi, Hiroki Furuta, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo


Abstract
Emergent misalignment, where fine-tuning on narrow, non-harmful tasks induces harmful behaviors, poses a key challenge for AI safety in LLMs. Despite growing empirical evidence, its underlying mechanism remains unclear. To uncover the reason behind this phenomenon, we propose a mechanistic account based on the geometry of feature superposition. Because features are encoded in overlapping, fine-tuning that amplifies a target feature also unintentionally strengthens nearby harmful features in accordance with their similarity. We give a simple gradient-level derivation of this mechanism and empirically test it across multiple LLMs (Gemma-2 2B/9B/27B, LLaMA-3.1 8B, gpt-oss 20B). Using sparse autoencoders (SAEs), we identify features tied to misalignment-inducing data and to harmful behaviors, and show that they are geometrically closer to each other than features derived from non-inducing data. This trend generalizes across domains (e.g., health, career, legal advice). Finally, we show that a geometry-aware approach—filtering training samples nearest to toxic features—reduces misalignment by 34.5%, substantially outperforming random removal and achieving stronger mitigation than LLM-as-a-judge–based filtering. Our study explains emergent misalignment through feature superposition, providing a basis for understanding and mitigating this phenomenon.
Anthology ID:
2026.acl-long.1402
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
30385–30414
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1402/
DOI:
Bibkey:
Cite (ACL):
Gouki Minegishi, Hiroki Furuta, Takeshi Kojima, Yusuke Iwasawa, and Yutaka Matsuo. 2026. Understanding Emergent Misalignment via Feature Superposition Geometry. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 30385–30414, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Understanding Emergent Misalignment via Feature Superposition Geometry (Minegishi et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1402.pdf
Checklist:
 2026.acl-long.1402.checklist.pdf