PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage
Krishna Kanth Nakka, Xue Jiang, Dmitrii Usynin, Xuebing Zhou
Abstract
This paper investigates privacy jailbreaking in large language models (LLMs) via steering, examining whether targeted manipulation of internal activations can circumvent the alignment mechanisms and alter model behaviour on privacy-sensitive queries, such as those concerning sexual orientation of public figures. Our approach begins by identifying attention heads predictive of refusal behaviour for a given private attribute, using lightweight linear probes trained on labels provided by a privacy evaluator. We then apply steering to a carefully selected subset of these heads, guided by the probe outputs, to induce positive responses from the model. Empirical results demonstrate that these steered responses frequently reveal the target attribute, as well as additional personal information about the data subject, including life events, relationships, and biographical details. Evaluations across three LLMs show that steering achieves disclosure rates of at least 80% with several responses containing real personal information. This controlled study highlights a concrete privacy risk: personal information memorised during pre-training can be extracted through targeted activation-level interventions, without reliance on computationally intensive adversarial prompting techniques.- Anthology ID:
- 2026.trustnlp-main.16
- Volume:
- Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California
- Editors:
- Kai-Wei Chang, Ninareh Mehrabi, Satyapriya Krishna, Anubrata Das, Jwala Dhamala, Yang Trista Cao, Tharindu Kumarage, Anil Ramakrishna, Christos Christodoulopoulos, Yixin Wan, Aram Galystan, Anoop Kumar, Rahul Gupta
- Venues:
- TrustNLP | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 272–286
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.trustnlp-main.16/
- DOI:
- Cite (ACL):
- Krishna Kanth Nakka, Xue Jiang, Dmitrii Usynin, and Xuebing Zhou. 2026. PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage. In Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026), pages 272–286, San Diego, California. Association for Computational Linguistics.
- Cite (Informal):
- PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage (Nakka et al., TrustNLP 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.trustnlp-main.16.pdf