Sunjae Kwon


2024

pdf
README: Bridging Medical Jargon and Lay Understanding for Patient Education through Data-Centric NLP
Zonghai Yao | Nandyala Siddharth Kantu | Guanghao Wei | Hieu Tran | Zhangqi Duan | Sunjae Kwon | Zhichao Yang | Hong Yu
Findings of the Association for Computational Linguistics: EMNLP 2024

The advancement in healthcare has shifted focus toward patient-centric approaches, particularly in self-care and patient education, facilitated by access to Electronic Health Records (EHR). However, medical jargon in EHRs poses significant challenges in patient comprehension. To address this, we introduce a new task of automatically generating lay definitions, aiming to simplify complex medical terms into patient-friendly lay language. We first created the README dataset, an extensive collection of over 50,000 unique (medical term, lay definition) pairs and 300,000 mentions, each offering context-aware lay definitions manually annotated by domain experts. We have also engineered a data-centric Human-AI pipeline that synergizes data filtering, augmentation, and selection to improve data quality. We then used README as the training data for models and leveraged a Retrieval-Augmented Generation method to reduce hallucinations and improve the quality of model outputs. Our extensive automatic and human evaluations demonstrate that open-source mobile-friendly models, when fine-tuned with high-quality data, are capable of matching or even surpassing the performance of state-of-the-art closed-source large language models like ChatGPT. This research represents a significant stride in closing the knowledge gap in patient education and advancing patient-centric healthcare solutions.

pdf
ODD: A Benchmark Dataset for the Natural Language Processing Based Opioid Related Aberrant Behavior Detection
Sunjae Kwon | Xun Wang | Weisong Liu | Emily Druhl | Minhee Sung | Joel Reisman | Wenjun Li | Robert Kerns | William Becker | Hong Yu
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Opioid related aberrant behaviors (ORABs) present novel risk factors for opioid overdose. This paper introduces a novel biomedical natural language processing benchmark dataset named ODD, for ORAB Detection Dataset. ODD is an expert-annotated dataset designed to identify ORABs from patients’ EHR notes and classify them into nine categories; 1) Confirmed Aberrant Behavior, 2) Suggested Aberrant Behavior, 3) Opioids, 4) Indication, 5) Diagnosed opioid dependency, 6) Benzodiazepines, 7) Medication Changes, 8) Central Nervous System-related, and 9) Social Determinants of Health. We explored two state-of-the-art natural language processing models (fine-tuning and prompt-tuning approaches) to identify ORAB. Experimental results show that the prompt-tuning models outperformed the fine-tuning models in most categories and the gains were especially higher among uncommon categories (Suggested Aberrant Behavior, Confirmed Aberrant Behaviors, Diagnosed Opioid Dependence, and Medication Change). Although the best model achieved the highest 88.17% on macro average area under precision recall curve, uncommon classes still have a large room for performance improvement. ODD is publicly available.

pdf
Blocks Architecture (BloArk): Efficient, Cost-Effective, and Incremental Dataset Architecture for Wikipedia Revision History
Lingxi Li | Zonghai Yao | Sunjae Kwon | Hong Yu
Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia

Wikipedia (Wiki) is one of the most widely used and publicly available resources for natural language processing (NLP) applications. Wikipedia Revision History (WikiRevHist) shows the order in which edits were made to any Wiki page since its first modification. While the most up-to-date Wiki has been widely used as a training source, WikiRevHist can also be valuable resources for NLP applications. However, there are insufficient tools available to process WikiRevHist without having substantial computing resources, making additional customization, and spending extra time adapting others’ works. Therefore, we report Blocks Architecture (BloArk), an efficiency-focused data processing architecture that reduces running time, computing resource requirements, and repeated works in processing WikiRevHist dataset. BloArk consists of three parts in its infrastructure: blocks, segments, and warehouses. On top of that, we build the core data processing pipeline: builder and modifier. The BloArk builder transforms the original WikiRevHist dataset from XML syntax into JSON Lines (JSONL) format for improving the concurrent and storage efficiency. The BloArk modifier takes previously-built warehouses to operate incremental modifications for improving the utilization of existing databases and reducing the cost of reusing others’ works. In the end, BloArk can scale up easily in both processing Wikipedia Revision History and incrementally modifying existing dataset for downstream NLP use cases. The source code, documentations, and example usages are publicly available online and open-sourced under GPL-2.0 license.

pdf
ClinicalMamba: A Generative Clinical Language Model on Longitudinal Clinical Notes
Zhichao Yang | Avijit Mitra | Sunjae Kwon | Hong Yu
Proceedings of the 6th Clinical Natural Language Processing Workshop

The advancement of natural language processing (NLP) systems in healthcare hinges on language models’ ability to interpret the intricate information contained within clinical notes. This process often requires integrating information from various time points in a patient’s medical history. However, most earlier clinical language models were pretrained with a context length limited to roughly one clinical document. In this study, We introduce ClinicalMamba, a specialized version of the Mamba language model, pretrained on a vast corpus of longitudinal clinical notes to address the unique linguistic characteristics and information processing needs of the medical domain. ClinicalMamba models, with 130 million and 2.8 billion parameters, demonstrate superior performance in modeling clinical language across extended text lengths compared to Mamba and other clinical models based on longformer and Llama. With few-shot learning, ClinicalMamba achieves notable benchmarks in speed and performance, outperforming existing clinical language models and large language models like GPT-4 in longitudinal clinical tasks.

pdf
UMass-BioNLP at MEDIQA-M3G 2024: DermPrompt - A Systematic Exploration of Prompt Engineering with GPT-4V for Dermatological Diagnosis
Parth Vashisht | Abhilasha Lodha | Mukta Maddipatla | Zonghai Yao | Avijit Mitra | Zhichao Yang | Sunjae Kwon | Junda Wang | Hong Yu
Proceedings of the 6th Clinical Natural Language Processing Workshop

This paper presents our team’s participation in the MEDIQA-ClinicalNLP 2024 shared task B. We present a novel approach to diagnosing clinical dermatology cases by integrating large multimodal models, specifically leveraging the capabilities of GPT-4V under a retriever and a re-ranker framework. Our investigation reveals that GPT-4V, when used as a retrieval agent, can accurately retrieve the correct skin condition 85% of the time using dermatological images and brief patient histories. Additionally, we empirically show that Naive Chain-of-Thought (CoT) works well for retrieval while Medical Guidelines Grounded CoT is required for accurate dermatological diagnosis. Further, we introduce a Multi-Agent Conversation (MAC) framework and show it’s superior performance and potential over the best CoT strategy. The experiments suggest that using naive CoT for retrieval and multi-agent conversation for critique-based diagnosis, GPT-4V can lead to an early and accurate diagnosis of dermatological conditions. The implications of this work extend to improving diagnostic workflows, supporting dermatological education, and enhancing patient care by providing a scalable, accessible, and accurate diagnostic tool.

2023

pdf
Vision Meets Definitions: Unsupervised Visual Word Sense Disambiguation Incorporating Gloss Information
Sunjae Kwon | Rishabh Garodia | Minhwa Lee | Zhichao Yang | Hong Yu
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Visual Word Sense Disambiguation (VWSD) is a task to find the image that most accurately depicts the correct sense of the target word for the given context. Previously, image-text matching models often suffered from recognizing polysemous words. This paper introduces an unsupervised VWSD approach that uses gloss information of an external lexical knowledge-base, especially the sense definitions. Specifically, we suggest employing Bayesian inference to incorporate the sense definitions when sense information of the answer is not provided. In addition, to ameliorate the out-of-dictionary (OOD) issue, we propose a context-aware definition generation with GPT-3. Experimental results show that the VWSD performance significantly increased with our Bayesian inference-based approach. In addition, our context-aware definition generation achieved prominent performance improvement in OOD examples exhibiting better performance than the existing definition generation method.

pdf
CR-COPEC: Causal Rationale of Corporate Performance Changes to learn from Financial Reports
Ye Chun | Sunjae Kwon | Kyunghwan Sohn | Nakwon Sung | Junyoup Lee | Byoung Seo | Kevin Compher | Seung-won Hwang | Jaesik Choi
Findings of the Association for Computational Linguistics: EMNLP 2023

In this paper, we introduce CR-COPEC called Causal Rationale of Corporate Performance Changes from financial reports. This is a comprehensive large-scale domain-adaptation causal sentence dataset to detect financial performance changes of corporate. CR-COPEC contributes to two major achievements. First, it detects causal rationale from 10-K annual reports of the U.S. companies, which contain experts’ causal analysis following accounting standards in a formal manner. This dataset can be widely used by both individual investors and analysts as material information resources for investing and decision-making without tremendous effort to read through all the documents. Second, it carefully considers different characteristics which affect the financial performance of companies in twelve industries. As a result, CR-COPEC can distinguish causal sentences in various industries by taking unique narratives in each industry into consideration. We also provide an extensive analysis of how well CR-COPEC dataset is constructed and suited for classifying target sentences as causal ones with respect to industry characteristics.

2022

pdf
MedJEx: A Medical Jargon Extraction Model with Wiki’s Hyperlink Span and Contextualized Masked Language Model Score
Sunjae Kwon | Zonghai Yao | Harmon Jordan | David Levy | Brian Corner | Hong Yu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

This paper proposes a new natural language processing (NLP) application for identifying medical jargon terms potentially difficult for patients to comprehend from electronic health record (EHR) notes. We first present a novel and publicly available dataset with expert-annotated medical jargon terms from 18K+ EHR note sentences (MedJ). Then, we introduce a novel medical jargon extraction (MedJEx) model which has been shown to outperform existing state-of-the-art NLP models. First, MedJEx improved the overall performance when it was trained on an auxiliary Wikipedia hyperlink span dataset, where hyperlink spans provide additional Wikipedia articles to explain the spans (or terms), and then fine-tuned on the annotated MedJ data. Secondly, we found that a contextualized masked language model score was beneficial for detecting domain-specific unfamiliar jargon terms. Moreover, our results show that training on the auxiliary Wikipedia hyperlink span datasets improved six out of eight biomedical named entity recognition benchmark datasets. MedJEx is publicly available.

2021

pdf
The Global Banking Standards QA Dataset (GBS-QA)
Kyunghwan Sohn | Sunjae Kwon | Jaesik Choi
Proceedings of the Third Workshop on Economics and Natural Language Processing

A domain specific question answering (QA) dataset dramatically improves the machine comprehension performance. This paper presents a new Global Banking Standards QA dataset (GBS-QA) in the banking regulation domain. The GBS-QA has three values. First, it contains actual questions from market players and answers from global rule setter, the Basel Committee on Banking Supervision (BCBS) in the middle of creating and revising banking regulations. Second, financial regulation experts analyze and verify pairs of questions and answers in the annotation process. Lastly, the GBS-QA is a totally different dataset with existing datasets in finance and is applicable to stimulate transfer learning research in the banking regulation domain.

2018

pdf
Word Sense Disambiguation Based on Word Similarity Calculation Using Word Vector Representation from a Knowledge-based Graph
Dongsuk O | Sunjae Kwon | Kyungsun Kim | Youngjoong Ko
Proceedings of the 27th International Conference on Computational Linguistics

Word sense disambiguation (WSD) is the task to determine the word sense according to its context. Many existing WSD studies have been using an external knowledge-based unsupervised approach because it has fewer word set constraints than supervised approaches requiring training data. In this paper, we propose a new WSD method to generate the context of an ambiguous word by using similarities between an ambiguous word and words in the input document. In addition, to leverage our WSD method, we further propose a new word similarity calculation method based on the semantic network structure of BabelNet. We evaluate the proposed methods on the SemEval-13 and SemEval-15 for English WSD dataset. Experimental results demonstrate that the proposed WSD method significantly improves the baseline WSD method. Furthermore, our WSD system outperforms the state-of-the-art WSD systems in the Semeval-13 dataset. Finally, it has higher performance than the state-of-the-art unsupervised knowledge-based WSD system in the average performance of both datasets.

2017

pdf
A Method to Generate a Machine-Labeled Data for Biomedical Named Entity Recognition with Various Sub-Domains
Juae Kim | Sunjae Kwon | Youngjoong Ko | Jungyun Seo
Proceedings of the International Workshop on Digital Disease Detection using Social Media 2017 (DDDSM-2017)

Biomedical Named Entity (NE) recognition is a core technique for various works in the biomedical domain. In previous studies, using machine learning algorithm shows better performance than dictionary-based and rule-based approaches because there are too many terminological variations of biomedical NEs and new biomedical NEs are constantly generated. To achieve the high performance with a machine-learning algorithm, good-quality corpora are required. However, it is difficult to obtain the good-quality corpora because an-notating a biomedical corpus for ma-chine-learning is extremely time-consuming and costly. In addition, most previous corpora are insufficient for high-level tasks because they cannot cover various domains. Therefore, we propose a method for generating a large amount of machine-labeled data that covers various domains. To generate a large amount of machine-labeled data, firstly we generate an initial machine-labeled data by using a chunker and MetaMap. The chunker is developed to extract only biomedical NEs with manually annotated data. MetaMap is used to annotate the category of bio-medical NE. Then we apply the self-training approach to bootstrap the performance of initial machine-labeled data. In our experiments, the biomedical NE recognition system that is trained with our proposed machine-labeled data achieves much high performance. As a result, our system outperforms biomedical NE recognition system that using MetaMap only with 26.03%p improvements on F1-score.

2016

pdf
KSAnswer: Question-answering System of Kangwon National University and Sogang University in the 2016 BioASQ Challenge
Hyeon-gu Lee | Minkyoung Kim | Harksoo Kim | Juae Kim | Sunjae Kwon | Jungyun Seo | Yi-reun Kim | Jung-Kyu Choi
Proceedings of the Fourth BioASQ workshop