Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

Anmol Goel; Cornelius Emde; Seong Joon Oh; Sangdoo Yun; Martin Gubri

Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

Anmol Goel, Cornelius Emde, Seong Joon Oh, Sangdoo Yun, Martin Gubri

Abstract

We identify a novel phenomenon in language models: benign fine-tuning of frontier models can lead to privacy collapse. We find that diverse, subtle patterns in training data can degrade contextual privacy, including optimisation for helpfulness, exposure to user information, emotional and subjective dialogue, and debugging code printing internal variables, among others. Finetuned models lose their ability to reason about contextual privacy norms, share information inappropriately with tools, and violate memory boundaries across contexts. Privacy collapse is a “silent failure” because models maintain high performance on standard safety and utility benchmarks whilst exhibiting severe privacy vulnerabilities. Our experiments show evidence of privacy collapse across six models (closed and open weight), five fine-tuning datasets (real-world and controlled data), and two task categories (agentic and memory-based). Our mechanistic analysis reveals that privacy representations are uniquely fragile to fine-tuning, compared to task-relevant features which are preserved. Our results reveal a critical gap in current safety evaluations, in particular for the deployment of specialised agents.

Anthology ID:: 2026.acl-long.400
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8870–8892
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.400/
DOI:
Bibkey:
Cite (ACL):: Anmol Goel, Cornelius Emde, Seong Joon Oh, Sangdoo Yun, and Martin Gubri. 2026. Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8870–8892, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models (Goel et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.400.pdf
Checklist:: 2026.acl-long.400.checklist.pdf

PDF Cite Search Checklist Fix data