Controlled Generation for Private Synthetic Text

Zihao Zhao, Anjalie Field


Abstract
Text anonymization is essential for responsibly developing and deploying AI in high-stakes domains such as healthcare, social services, and law. In this work, we propose a novel methodology for privacy-preserving synthetic text generation that leverages the principles of de-identification and the Hiding In Plain Sight (HIPS) theory. Our approach introduces entity-aware control codes to guide controllable generation using either in-context learning (ICL) or prefix tuning. The ICL variant ensures privacy levels consistent with the underlying de-identification system, while the prefix tuning variant incorporates a custom masking strategy and loss function to support scalable, high-quality generation. Experiments on legal and clinical datasets demonstrate that our method achieves a strong balance between privacy protection and utility, offering a practical and effective solution for synthetic text generation in sensitive domains.
Anthology ID:
2025.emnlp-main.1663
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
32708–32723
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1663/
DOI:
Bibkey:
Cite (ACL):
Zihao Zhao and Anjalie Field. 2025. Controlled Generation for Private Synthetic Text. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 32708–32723, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Controlled Generation for Private Synthetic Text (Zhao & Field, EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1663.pdf
Checklist:
 2025.emnlp-main.1663.checklist.pdf