Pulkit Madaan
2026
Multi-Token Completion for Text Anonymization
Pulkit Madaan | Krithika Ramesh | Lisa Bauer | Charith Peris | Anjalie Field
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Pulkit Madaan | Krithika Ramesh | Lisa Bauer | Charith Peris | Anjalie Field
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Text anonymization is a critical task for enabling research and development in high-stakes domains containing private data, like medicine, law, and social services. While much research has focused on redacting sensitive content from text, substantially less work has focused on what to replace redacted content with, which can enhance privacy and becomes increasingly important with greater levels of redaction. In this work, we formulate predicting replacements for sensitive spans as a research task with principled use-inspired evaluation criteria. We further propose a multi-token completion method for accomplishing this task that is designed to preserve consistency with low compute requirements, thus facilitating practitioners to anonymize data locally before sharing it externally. Human and automated annotations demonstrate that our approach produces more realistic text and better preserves utility than alternative infilling methods and differentially private mechanisms across multiple domains without retraining. Overall, our work explores the under-studied task of what to replace redacted content with and contributes grounded evaluations capturing utility, facilitating future work.
2024
Evaluating Differentially Private Synthetic Data Generation in High-Stakes Domains
Krithika Ramesh | Nupoor Gandhi | Pulkit Madaan | Lisa Bauer | Charith Peris | Anjalie Field
Findings of the Association for Computational Linguistics: EMNLP 2024
Krithika Ramesh | Nupoor Gandhi | Pulkit Madaan | Lisa Bauer | Charith Peris | Anjalie Field
Findings of the Association for Computational Linguistics: EMNLP 2024
The difficulty of anonymizing text data hinders the development and deployment of NLP in high-stakes domains that involve private data, such as healthcare and social services. Poorly anonymized sensitive data cannot be easily shared with annotators or external researchers, nor can it be used to train public models. In this work, we explore the feasibility of using synthetic data generated from differentially private language models in place of real data to facilitate the development of NLP in these domains without compromising privacy. In contrast to prior work, we generate synthetic data for real high-stakes domains, and we propose and conduct use-inspired evaluations to assess data quality. Our results show that prior simplistic evaluations have failed to highlight utility, privacy, and fairness issues in the synthetic data. Overall, our work underscores the need for further improvements to synthetic data generation for it to be a viable way to enable privacy-preserving data sharing.
2020
Multilingual Neural Machine Translation involving Indian Languages
Pulkit Madaan | Fatiha Sadat
Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation
Pulkit Madaan | Fatiha Sadat
Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation
Neural Machine Translations (NMT) models are capable of translating a single bilingual pair and require a new model for each new language pair. Multilingual Neural Machine Translation models are capable of translating multiple language pairs, even pairs which it hasn’t seen before in training. Availability of parallel sentences is a known problem in machine translation. Multilingual NMT model leverages information from all the languages to improve itself and performs better. We propose a data augmentation technique that further improves this model profoundly. The technique helps achieve a jump of more than 15 points in BLEU score from the multilingual NMT model. A BLEU score of 36.2 was achieved for Sindhi–English translation, which is higher than any score on the leaderboard of the LoResMT SharedTask at MT Summit 2019, which provided the data for the experiments.