Data-Constrained Synthesis of Training Data for De-Identification

Thomas Vakili; Aron Henriksson; Hercules Dalianis

Data-Constrained Synthesis of Training Data for De-Identification

Thomas Vakili, Aron Henriksson, Hercules Dalianis

Abstract

Many sensitive domains — such as the clinical domain — lack widely available datasets due to privacy risks. The increasing generative capabilities of large language models (LLMs) have made synthetic datasets a viable path forward. In this study, we domain-adapt LLMs to the clinical domain and generate synthetic clinical texts that are machine-annotated with tags for personally identifiable information using capable encoder-based NER models. The synthetic corpora are then used to train synthetic NER models. The results show that training NER models using synthetic corpora incurs only a small drop in predictive performance. The limits of this process are investigated in a systematic ablation study — using both Swedish and Spanish data. Our analysis shows that smaller datasets can be sufficient for domain-adapting LLMs for data synthesis. Instead, the effectiveness of this process is almost entirely contingent on the performance of the machine-annotating NER models trained using the original data.

Anthology ID:: 2025.acl-long.1329
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 27414–27427
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1329/
DOI:
Bibkey:
Cite (ACL):: Thomas Vakili, Aron Henriksson, and Hercules Dalianis. 2025. Data-Constrained Synthesis of Training Data for De-Identification. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 27414–27427, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Data-Constrained Synthesis of Training Data for De-Identification (Vakili et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1329.pdf

PDF Cite Search Fix data