Dmitrii Usynin


2026

This paper investigates privacy jailbreaking in large language models (LLMs) via steering, examining whether targeted manipulation of internal activations can circumvent the alignment mechanisms and alter model behaviour on privacy-sensitive queries, such as those concerning sexual orientation of public figures. Our approach begins by identifying attention heads predictive of refusal behaviour for a given private attribute, using lightweight linear probes trained on labels provided by a privacy evaluator. We then apply steering to a carefully selected subset of these heads, guided by the probe outputs, to induce positive responses from the model. Empirical results demonstrate that these steered responses frequently reveal the target attribute, as well as additional personal information about the data subject, including life events, relationships, and biographical details. Evaluations across three LLMs show that steering achieves disclosure rates of at least 80% with several responses containing real personal information. This controlled study highlights a concrete privacy risk: personal information memorised during pre-training can be extracted through targeted activation-level interventions, without reliance on computationally intensive adversarial prompting techniques.