Jangwon Kim


2025

pdf bib
RedactOR: An LLM-Powered Framework for Automatic Clinical Data De-Identification
Praphul Singh | Charlotte Dzialo | Jangwon Kim | Sumana Srivatsa | Irfan Bulu | Sri Gadde | Krishnaram Kenthapadi
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)

Ensuring clinical data privacy while preserving utility is critical for AI-driven healthcare and data analytics. Existing de-identification (De-ID) methods, including rule-based techniques, deep learning models, and large language models (LLMs), often suffer from recall errors, limited generalization, and inefficiencies, limiting their real-world applicability. We propose a fully automated, multi-modal framework, RedactOR for de-identifying structured and unstructured electronic health records, including clinical audio records. Our framework employs cost-efficient De-ID strategies, including intelligent routing, hybrid rule and LLM based approaches, and a two-step audio redaction approach. We present a retrieval-based entity relexicalization approach to ensure consistent substitutions of protected entities, thereby enhancing data coherence for downstream applications. We discuss key design desiderata, de-identification and relexicalization methodology, and modular architecture of RedactOR and its integration with Oracle Health Clinical AI system. Evaluated on the i2b2 2014 De-ID dataset using standard metrics with strict recall, our approach achieves competitive performance while optimizing token usage to reduce LLM costs. Finally, we discuss key lessons and insights from deployment in real-world AI-driven healthcare data pipelines.

2023

pdf bib
Clinical note section classification on doctor-patient conversations in low-resourced settings
Zhuohao Chen | Jangwon Kim | Yang Liu | Shrikanth Narayanan
Proceedings of the Third Workshop on NLP for Medical Conversations

2022

pdf bib
Leveraging Task Transferability to Meta-learning for Clinical Section Classification with Limited Data
Zhuohao Chen | Jangwon Kim | Ram Bhakta | Mustafa Sir
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Identifying sections is one of the critical components of understanding medical information from unstructured clinical notes and developing assistive technologies for clinical note-writing tasks. Most state-of-the-art text classification systems require thousands of in-domain text data to achieve high performance. However, collecting in-domain and recent clinical note data with section labels is challenging given the high level of privacy and sensitivity. The present paper proposes an algorithmic way to improve the task transferability of meta-learning-based text classification in order to address the issue of low-resource target data. Specifically, we explore how to make the best use of the source dataset and propose a unique task transferability measure named Normalized Negative Conditional Entropy (NNCE). Leveraging the NNCE, we develop strategies for selecting clinical categories and sections from source task data to boost cross-domain meta-learning accuracy. Experimental results show that our task selection strategies improve section classification accuracy significantly compared to meta-learning algorithms.