2025
pdf
bib
abs
PrivacyScalpel: Enhancing LLM Privacy via Interpretable Feature Intervention with Sparse Autoencoders
Ahmed Frikha
|
Muhammad Reza Ar Razi
|
Krishna Kanth Nakka
|
Ricardo Mendes
|
Xue Jiang
|
Xuebing Zhou
Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
Large Language Models (LLMs) achieve impressive natural language processing performance but can memorize and leak Personally Identifiable Information (PII), posing serious privacy risks. Existing mitigation strategies—such as differential privacy and neuron-level interventions—often degrade utility or fail to reliably prevent leakage. We present PrivacyScalpel, a privacy-preserving framework that leverages LLM interpretability to identify and suppress PII leakage while preserving performance. PrivacyScalpel operates in three stages: (1) Feature Probing to locate model layers encoding PII-rich representations; (2) Sparse Autoencoding using a k-Sparse Autoencoder (k-SAE) to disentangle and isolate privacy-sensitive features; and (3) Feature-Level Interventions via targeted ablation and vector steering to reduce leakage. Experiments on Gemma2-2B and Llama2-7B fine-tuned with the Enron dataset show that PrivacyScalpel reduces email leakage from 5.15% to 0.0% while retaining over 99.4% of the original utility. Compared to neuron-level methods, our approach achieves a superior privacy–utility trade-off, highlighting the effectiveness of targeting sparse, monosemantic features over polysemantic neurons. Beyond privacy gains, PrivacyScalpel offers interpretability insights into PII memorization mechanisms, contributing to safer and more transparent LLM deployment.
pdf
bib
abs
Privacy Checklist: Privacy Violation Detection Grounding on Contextual Integrity Theory
Haoran Li
|
Wei Fan
|
Yulin Chen
|
Cheng Jiayang
|
Tianshu Chu
|
Xuebing Zhou
|
Peizhao Hu
|
Yangqiu Song
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Privacy research has attracted wide attention as individuals worry that their private data can be easily leaked during interactions with smart devices, social platforms, and AI applications. Existing works mostly consider privacy attacks and defenses on various sub-fields. Within each field, various privacy attacks and defenses are studied to address patterns of personally identifiable information (PII). In this paper, we argue that privacy is not solely about PII patterns. We ground on the Contextual Integrity (CI) theory which posits that people’s perceptions of privacy are highly correlated with the corresponding social context. Based on such an assumption, we formulate privacy as a reasoning problem rather than naive PII matching. We develop the first comprehensive checklist that covers social identities, private attributes, and existing privacy regulations. Unlike prior works on CI that either cover limited expert annotated norms or model incomplete social context, our proposed privacy checklist uses the whole Health Insurance Portability and Accountability Act of 1996 (HIPAA) as an example, to show that we can resort to large language models (LLMs) to completely cover the HIPAA’s regulations. Additionally, our checklist also gathers expert annotations across multiple ontologies to determine private information including but not limited to PII. We use our preliminary results on the HIPAA to shed light on future context-centric privacy research to cover more privacy regulations, social norms and standards. We will release the reproducible code and data.
2024
pdf
bib
abs
PII-Compass: Guiding LLM training data extraction prompts towards the target PII via grounding
Krishna Kanth Nakka
|
Ahmed Frikha
|
Ricardo Mendes
|
Xue Jiang
|
Xuebing Zhou
Proceedings of the Fifth Workshop on Privacy in Natural Language Processing
The latest and most impactful advances in large models stem from their increased size. Unfortunately, this translates into an improved memorization capacity, raising data privacy concerns. Specifically, it has been shown that models can output personal identifiable information (PII) contained in their training data. However, reported PII extraction performance varies widely, and there is no consensus on the optimal methodology to evaluate this risk, resulting in underestimating realistic adversaries. In this work, we empirically demonstrate that it is possible to improve the extractability of PII by over ten-fold by grounding the prefix of the manually constructed extraction prompt with in-domain data. This approach achieves phone number extraction rates of 0.92%, 3.9%, and 6.86% with 1, 128, and 2308 queries, respectively, i.e., the phone number of 1 person in 15 is extractable.