Maulik Chevli

2025

pdf bib abs
Leveraging Semantic Triples for Private Document Generation with Local Differential Privacy Guarantees
Stephen Meisenbacher | Maulik Chevli | Florian Matthes
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Many works at the intersection of Differential Privacy (DP) in Natural Language Processing aim to protect privacy by transforming texts under DP guarantees. This can be performed in a variety of ways, from word perturbations to full document rewriting, and most often under *local* DP. Here, an input text must be made indistinguishable from any other potential text, within some bound governed by the privacy parameter 𝜀. Such a guarantee is quite demanding, and recent works show that privatizing texts under local DP can only be done reasonably under very high 𝜀 values. Addressing this challenge, we introduce **DP-ST**, which leverages semantic triples for neighborhood-aware private document generation under local DP guarantees. Through the evaluation of our method, we demonstrate the effectiveness of the *divide-and-conquer* paradigm, particularly when limiting the DP notion (and privacy guarantees) to that of a *privatization neighborhood*. When combined with LLM post-processing, our method allows for coherent text generation even at lower 𝜀 values, while still balancing privacy and utility. These findings highlight the importance of coherence in achieving balanced privatization outputs at reasonable 𝜀 levels.

pdf bib abs
On the Impact of Noise in Differentially Private Text Rewriting
Stephen Meisenbacher | Maulik Chevli | Florian Matthes
Findings of the Association for Computational Linguistics: NAACL 2025

The field of text privatization often leverages the notion of *Differential Privacy* (DP) to provide formal guarantees in the rewriting or obfuscation of sensitive textual data. A common and nearly ubiquitous form of DP application necessitates the addition of calibrated noise to vector representations of text, either at the data- or model-level, which is governed by the privacy parameter 𝜀. However, noise addition almost undoubtedly leads to considerable utility loss, thereby highlighting one major drawback of DP in NLP. In this work, we introduce a new sentence infilling privatization technique, and we use this method to explore the effect of noise in DP text rewriting. We empirically demonstrate that non-DP privatization techniques excel in utility preservation and can find an acceptable empirical privacy-utility trade-off, yet cannot outperform DP methods in empirical privacy protections. Our results highlight the significant impact of noise in current DP rewriting mechanisms, leading to a discussion of the merits and challenges of DP in NLP as well as the opportunities that non-DP methods present.

2024

pdf bib
DP-MLM: Differentially Private Text Rewriting Using Masked Language Models
Stephen Meisenbacher | Maulik Chevli | Juraj Vladika | Florian Matthes
Findings of the Association for Computational Linguistics: ACL 2024

pdf bib abs
A Collocation-based Method for Addressing Challenges in Word-level Metric Differential Privacy
Stephen Meisenbacher | Maulik Chevli | Florian Matthes
Proceedings of the Fifth Workshop on Privacy in Natural Language Processing

Applications of Differential Privacy (DP) in NLP must distinguish between the syntactic level on which a proposed mechanism operates, often taking the form of *word-level* or *document-level* privatization. Recently, several word-level *Metric* Differential Privacy approaches have been proposed, which rely on this generalized DP notion for operating in word embedding spaces. These approaches, however, often fail to produce semantically coherent textual outputs, and their application at the sentence- or document-level is only possible by a basic composition of word perturbations. In this work, we strive to address these challenges by operating *between* the word and sentence levels, namely with *collocations*. By perturbing n-grams rather than single words, we devise a method where composed privatized outputs have higher semantic coherence and variable length. This is accomplished by constructing an embedding model based on frequently occurring word groups, in which unigram words co-exist with bi- and trigram collocations. We evaluate our method in utility and privacy tests, which make a clear case for tokenization strategies beyond the word level.

Co-authors

Venues

Fix author