PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage

Krishna Kanth Nakka, Xue Jiang, Dmitrii Usynin, Xuebing Zhou


Abstract
This paper investigates privacy jailbreaking in large language models (LLMs) via steering, examining whether targeted manipulation of internal activations can circumvent the alignment mechanisms and alter model behaviour on privacy-sensitive queries, such as those concerning sexual orientation of public figures. Our approach begins by identifying attention heads predictive of refusal behaviour for a given private attribute, using lightweight linear probes trained on labels provided by a privacy evaluator. We then apply steering to a carefully selected subset of these heads, guided by the probe outputs, to induce positive responses from the model. Empirical results demonstrate that these steered responses frequently reveal the target attribute, as well as additional personal information about the data subject, including life events, relationships, and biographical details. Evaluations across three LLMs show that steering achieves disclosure rates of at least 80% with several responses containing real personal information. This controlled study highlights a concrete privacy risk: personal information memorised during pre-training can be extracted through targeted activation-level interventions, without reliance on computationally intensive adversarial prompting techniques.
Anthology ID:
2026.trustnlp-main.16
Volume:
Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)
Month:
July
Year:
2026
Address:
San Diego, California
Editors:
Kai-Wei Chang, Ninareh Mehrabi, Satyapriya Krishna, Anubrata Das, Jwala Dhamala, Yang Trista Cao, Tharindu Kumarage, Anil Ramakrishna, Christos Christodoulopoulos, Yixin Wan, Aram Galystan, Anoop Kumar, Rahul Gupta
Venues:
TrustNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
272–286
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.trustnlp-main.16/
DOI:
Bibkey:
Cite (ACL):
Krishna Kanth Nakka, Xue Jiang, Dmitrii Usynin, and Xuebing Zhou. 2026. PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage. In Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026), pages 272–286, San Diego, California. Association for Computational Linguistics.
Cite (Informal):
PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage (Nakka et al., TrustNLP 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.trustnlp-main.16.pdf