Beyond Reconstruction: Generating Privacy-Preserving Clinical Letters

Libo Ren, Samuel Belkadi, Lifeng Han, Warren Del-Pinto, Goran Nenadic


Abstract
Due to the sensitive nature of clinical letters, their use in model training, medical research, and education is limited. This work aims to generate diverse, de-identified, and high-quality synthetic clinical letters to enhance privacy protection. This study explores various pre-trained language models (PLMs) for text masking and generation, employing various masking strategies with a focus on Bio_ClinicalBERT. Both qualitative and quantitative methods are used for evaluation, supplemented by a downstream Named Entity Recognition (NER) task. Our results indicate that encoder-only models outperform encoder-decoder models. General-domain and clinical-domain PLMs exhibit comparable performance when clinical information is preserved. Preserving clinical entities and document structure yields better performance than fine-tuning alone. Masking stopwords enhances text quality, whereas masking nouns or verbs has a negative impact. BERTScore proves to be the most reliable quantitative evaluation metric in our task. Contextual information has minimal impact, indicating that synthetic letters can effectively replace original ones in downstream tasks. Unlike previous studies that focus primarily on reconstructing original letters or training a privacy-detection and substitution model, this project provides a framework for generating diverse clinical letters while embedding privacy detection, enabling sensitive dataset expansion and facilitating the use of real-world clinical data. Our codes and trained models will be publicly available at https://github.com/HECTA-UoM/Synthetic4Health.
Anthology ID:
2025.privatenlp-main.6
Volume:
Proceedings of the Sixth Workshop on Privacy in Natural Language Processing
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Ivan Habernal, Sepideh Ghanavati, Vijayanta Jain, Timour Igamberdiev, Shomir Wilson
Venues:
PrivateNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
60–74
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.privatenlp-main.6/
DOI:
Bibkey:
Cite (ACL):
Libo Ren, Samuel Belkadi, Lifeng Han, Warren Del-Pinto, and Goran Nenadic. 2025. Beyond Reconstruction: Generating Privacy-Preserving Clinical Letters. In Proceedings of the Sixth Workshop on Privacy in Natural Language Processing, pages 60–74, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Beyond Reconstruction: Generating Privacy-Preserving Clinical Letters (Ren et al., PrivateNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.privatenlp-main.6.pdf