Abstract
Protecting privacy leakage in large language models remains a paramount challenge. In this paper, we reveal Privacy Seesaw in LLM privacy safeguarding, a phenomenon where measures to secure specific private information inadvertently heighten exposure risks for other privacy. Through comprehensive analysis, we identify the amount of targeted privacy data and the volume of edited privacy neurons as the two central triggers to this issue. To mitigate privacy seesaw, we propose Augmented Privacy Neuron Editing via Activation Patching (APNEAP), a novel framework designed to well balance model performance with privacy protection. The proposed APNEAP augments collected private data by automatically synthesizing new private data, which deactivates the first trigger to the privacy seesaw issue. Additionally, it adapts activation patching to privacy neuron editing for switching off the second trigger to the privacy seesaw problem. Experimental results show that the proposed APNEAP is capable of alleviating the privacy seesaw phenomenon and offers a more stable and reliable approach to privacy protection in LLMs than previous methods.- Anthology ID:
- 2024.findings-acl.315
- Volume:
- Findings of the Association for Computational Linguistics ACL 2024
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand and virtual meeting
- Editors:
- Lun-Wei Ku, Andre Martins, Vivek Srikumar
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5319–5332
- Language:
- URL:
- https://aclanthology.org/2024.findings-acl.315
- DOI:
- Cite (ACL):
- Xinwei Wu, Weilong Dong, Shaoyang Xu, and Deyi Xiong. 2024. Mitigating Privacy Seesaw in Large Language Models: Augmented Privacy Neuron Editing via Activation Patching. In Findings of the Association for Computational Linguistics ACL 2024, pages 5319–5332, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
- Cite (Informal):
- Mitigating Privacy Seesaw in Large Language Models: Augmented Privacy Neuron Editing via Activation Patching (Wu et al., Findings 2024)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-4/2024.findings-acl.315.pdf