Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models

Sarah Ball; Frauke Kreuter; Nina Panickssery

Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models

Sarah Ball, Frauke Kreuter, Nina Panickssery

Abstract

Conversational large language models are trained to refuse to answer harmful questions. However, emergent jailbreaking techniques can still elicit unsafe outputs, presenting an ongoing challenge for model alignment. This paper aims to deepen our understanding of how different jailbreak types circumvent safeguards by analyzing model activations on different jailbreak inputs. We find that it is possible to extract a jailbreak vector from a single class of jailbreaks that works to mitigate jailbreak effectiveness from other, semantically-dissimilar classes. This suggests that diverse jailbreaks may exploit a common internal mechanism. We investigate a potential common mechanism of harmfulness feature suppression, and find evidence that effective jailbreaks noticeably reduce a model’s perception of prompt harmfulness. These insights pave the way for developing more robust jailbreak countermeasures and lay the groundwork for a deeper, mechanistic understanding of jailbreak dynamics in language models.

Anthology ID:: 2026.eacl-long.12
Volume:: Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 250–279
Language:
URL:: https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.12/
DOI:
Bibkey:
Cite (ACL):: Sarah Ball, Frauke Kreuter, and Nina Panickssery. 2026. Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 250–279, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models (Ball et al., EACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.12.pdf

PDF Cite Search Fix data