Yunsoo Kim


2024

pdf
KnowLab_AIMed at MEDIQA-CORR 2024: Chain-of-Though (CoT) prompting strategies for medical error detection and correction
Zhaolong Wu | Abul Hasan | Jinge Wu | Yunsoo Kim | Jason Cheung | Teng Zhang | Honghan Wu
Proceedings of the 6th Clinical Natural Language Processing Workshop

This paper describes our submission to the MEDIQA-CORR 2024 shared task for automatically detecting and correcting medical errors in clinical notes. We report results for three methods of few-shot In-Context Learning (ICL) augmented with Chain-of-Thought (CoT) and reason prompts using a large language model (LLM). In the first method, we manually analyse a subset of train and validation dataset to infer three CoT prompts by examining error types in the clinical notes. In the second method, we utilise the training dataset to prompt the LLM to deduce reasons about their correctness or incorrectness. The constructed CoTs and reasons are then augmented with ICL examples to solve the tasks of error detection, span identification, and error correction. Finally, we combine the two methods using a rule-based ensemble method. Across the three sub-tasks, our ensemble method achieves a ranking of 3rd for both sub-task 1 and 2, while securing 7th place in sub-task 3 among all submissions.

2023

pdf
Chemical Language Understanding Benchmark
Yunsoo Kim | Hyuk Ko | Jane Lee | Hyun Young Heo | Jinyoung Yang | Sungsoo Lee | Kyu-hwang Lee
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)

In this paper, we introduce the benchmark datasets named CLUB (Chemical Language Understanding Benchmark) to facilitate NLP research in the chemical industry. We have 4 datasets consisted of text and token classification tasks. As far as we have recognized, it is one of the first examples of chemical language understanding benchmark datasets consisted of tasks for both patent and literature articles provided by industrial organization. All the datasets are internally made by chemists from scratch. Finally, we evaluate the datasets on the various language models based on BERT and RoBERTa, and demonstrate the model performs better when the domain of the pretrained models are closer to chemistry domain. We provide baselines for our benchmark as 0.8054 in average, and we hope this benchmark is used by many researchers in both industry and academia.