With Privacy, Size Matters: On the Importance of Dataset Size in Differentially Private Text Rewriting

Stephen Meisenbacher, Florian Matthes


Abstract
Recent work in Differential Privacy with Natural Language Processing (DP NLP) has proposed numerous promising techniques in the form of *text rewriting* mechanisms. In the evaluation of these mechanisms, an often-ignored aspect is that of *dataset size*, or rather, the effect of dataset size on a mechanism’s efficacy for utility and privacy preservation. In this work, we are the first to introduce this factor in the evaluation of DP text privatization, where we design utility and privacy tests on large-scale datasets with dynamic split sizes. We run these tests on datasets of varying size with up to one million texts, and we focus on quantifying the effect of increasing dataset size on the privacy-utility trade-off. Our findings reveal that dataset size plays an integral part in evaluating DP text rewriting mechanisms; additionally, these findings call for more rigorous evaluation procedures in DP NLP, as well as shed light on the future of DP NLP in practice and at scale.
Anthology ID:
2025.ijcnlp-long.3
Volume:
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Month:
December
Year:
2025
Address:
Mumbai, India
Editors:
Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, Dhirendra Pratap Singh
Venues:
IJCNLP | AACL
SIG:
Publisher:
The Asian Federation of Natural Language Processing and The Association for Computational Linguistics
Note:
Pages:
37–47
Language:
URL:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-long.3/
DOI:
Bibkey:
Cite (ACL):
Stephen Meisenbacher and Florian Matthes. 2025. With Privacy, Size Matters: On the Importance of Dataset Size in Differentially Private Text Rewriting. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 37–47, Mumbai, India. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics.
Cite (Informal):
With Privacy, Size Matters: On the Importance of Dataset Size in Differentially Private Text Rewriting (Meisenbacher & Matthes, IJCNLP-AACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-long.3.pdf