2025
pdf
bib
abs
RedactOR: An LLM-Powered Framework for Automatic Clinical Data De-Identification
Praphul Singh
|
Charlotte Dzialo
|
Jangwon Kim
|
Sumana Srivatsa
|
Irfan Bulu
|
Sri Gadde
|
Krishnaram Kenthapadi
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Ensuring clinical data privacy while preserving utility is critical for AI-driven healthcare and data analytics. Existing de-identification (De-ID) methods, including rule-based techniques, deep learning models, and large language models (LLMs), often suffer from recall errors, limited generalization, and inefficiencies, limiting their real-world applicability. We propose a fully automated, multi-modal framework, RedactOR for de-identifying structured and unstructured electronic health records, including clinical audio records. Our framework employs cost-efficient De-ID strategies, including intelligent routing, hybrid rule and LLM based approaches, and a two-step audio redaction approach. We present a retrieval-based entity relexicalization approach to ensure consistent substitutions of protected entities, thereby enhancing data coherence for downstream applications. We discuss key design desiderata, de-identification and relexicalization methodology, and modular architecture of RedactOR and its integration with Oracle Health Clinical AI system. Evaluated on the i2b2 2014 De-ID dataset using standard metrics with strict recall, our approach achieves competitive performance while optimizing token usage to reduce LLM costs. Finally, we discuss key lessons and insights from deployment in real-world AI-driven healthcare data pipelines.
pdf
bib
abs
Mastering the Craft of Data Synthesis for CodeLLMs
Meng Chen
|
Philip Arthur
|
Qianyu Feng
|
Cong Duy Vu Hoang
|
Yu-Heng Hong
|
Mahdi Kazemi Moghaddam
|
Omid Nezami
|
Duc Thien Nguyen
|
Gioacchino Tangari
|
Duy Vu
|
Thanh Vu
|
Mark Johnson
|
Krishnaram Kenthapadi
|
Don Dharmasiri
|
Long Duong
|
Yuan-Fang Li
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Large language models (LLMs) have shown impressive performance in code understanding and generation, making coding tasks a key focus for researchers due to their practical applications and value as a testbed for LLM evaluation. Data synthesis and filtering techniques have been widely adopted and shown to be highly effective in this context. In this paper, we present a focused survey and taxonomy of these techniques, emphasizing recent advancements. We highlight key challenges, explore future research directions, and offer practical guidance for new researchers entering the field.
2021
pdf
bib
On the Lack of Robust Interpretability of Neural Text Classifiers
Muhammad Bilal Zafar
|
Michele Donini
|
Dylan Slack
|
Cedric Archambeau
|
Sanjiv Das
|
Krishnaram Kenthapadi
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
2019
pdf
bib
abs
What’s in a Name? Reducing Bias in Bios without Access to Protected Attributes
Alexey Romanov
|
Maria De-Arteaga
|
Hanna Wallach
|
Jennifer Chayes
|
Christian Borgs
|
Alexandra Chouldechova
|
Sahin Geyik
|
Krishnaram Kenthapadi
|
Anna Rumshisky
|
Adam Kalai
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
There is a growing body of work that proposes methods for mitigating bias in machine learning systems. These methods typically rely on access to protected attributes such as race, gender, or age. However, this raises two significant challenges: (1) protected attributes may not be available or it may not be legal to use them, and (2) it is often desirable to simultaneously consider multiple protected attributes, as well as their intersections. In the context of mitigating bias in occupation classification, we propose a method for discouraging correlation between the predicted probability of an individual’s true occupation and a word embedding of their name. This method leverages the societal biases that are encoded in word embeddings, eliminating the need for access to protected attributes. Crucially, it only requires access to individuals’ names at training time and not at deployment time. We evaluate two variations of our proposed method using a large-scale dataset of online biographies. We find that both variations simultaneously reduce race and gender biases, with almost no reduction in the classifier’s overall true positive rate.