Dmitrii Usynin

2026

PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage
Krishna Kanth Nakka | Xue Jiang | Dmitrii Usynin | Xuebing Zhou
Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)

This paper investigates privacy jailbreaking in large language models (LLMs) via steering, examining whether targeted manipulation of internal activations can circumvent the alignment mechanisms and alter model behaviour on privacy-sensitive queries, such as those concerning sexual orientation of public figures. Our approach begins by identifying attention heads predictive of refusal behaviour for a given private attribute, using lightweight linear probes trained on labels provided by a privacy evaluator. We then apply steering to a carefully selected subset of these heads, guided by the probe outputs, to induce positive responses from the model. Empirical results demonstrate that these steered responses frequently reveal the target attribute, as well as additional personal information about the data subject, including life events, relationships, and biographical details. Evaluations across three LLMs show that steering achieves disclosure rates of at least 80% with several responses containing real personal information. This controlled study highlights a concrete privacy risk: personal information memorised during pre-training can be extracted through targeted activation-level interventions, without reliance on computationally intensive adversarial prompting techniques.

Co-authors

Venues

TrustNLP1
WS1

Fix author