Sun Kim

2025

Although there has been a growing interest among industries in integrating generative LLMs into their services, limited experience and scarcity of resources act as a barrier in launching and servicing large-scale LLM-based services. In this paper, we share our experiences in developing and operating generative AI models within a national-scale search engine, with a specific focus on the sensitiveness of user queries. We propose a taxonomy for sensitive search queries, outline our approaches, and present a comprehensive analysis report on sensitive queries from actual users. We believe that our experiences in launching generative AI search systems can contribute to reducing the barrier in building generative LLM-based services.

pdf bib abs
MV-CLAM: Multi-View Molecular Interpretation with Cross-Modal Projection via Language Model
Sumin Ha | Jun Hyeong Kim | Yinhua Piao | Changyun Cho | Sun Kim
Findings of the Association for Computational Linguistics: EMNLP 2025

Deciphering molecular meaning in chemistry and biomedicine depends on context — a capability that large language models (LLMs) can enhance by aligning molecular structures with language. However, existing molecule-text models ignore complementary information in different molecular views and rely on single-view representations, limiting molecule structural understanding. Moreover, naïve multi-view alignment strategies face two challenges: (1) the aligned spaces differ across views due to inconsistent molecule-text mappings, and (2) existing loss objectives fail to preserve complementary information necessary for finegrained alignment. To enhance LLM’s ability to understand molecular structure, we propose MV-CLAM, a novel framework that aligns multi-view molecular representations into a unified textual space using a multi-querying transformer (MQ-Former). Our approach ensures cross-view consistency while the proposed token-level contrastive loss preserves diverse molecular features across textual queries. MV-CLAM enhances molecular reasoning, improving retrieval and captioning accuracy. The source code of MV-CLAM is available in https://github.com/sumin124/mv-clam.

2024

Most prior safety research of large language models (LLMs) has focused on enhancing the alignment of LLMs to better suit the safety requirements of their use cases. However, internalizing such safeguard features into larger models brought challenges of higher training cost and unintended degradation of helpfulness. In this paper, we leverage a smaller LLM for both harmful query detection and safeguard response generation. We introduce our safety requirements and the taxonomy of harmfulness categories, and then propose a multi-task learning mechanism fusing the two tasks into a single model. We demonstrate the effectiveness of our approach, providing on par or surpassing harmful query detection and safeguard response performance compared to the publicly available LLMs.

2023

pdf bib abs
Clinical Note Owns its Hierarchy: Multi-Level Hypergraph Neural Networks for Patient-Level Representation Learning
Nayeon Kim | Yinhua Piao | Sun Kim
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Leveraging knowledge from electronic health records (EHRs) to predict a patient’s condition is essential to the effective delivery of appropriate care. Clinical notes of patient EHRs contain valuable information from healthcare professionals, but have been underused due to their difficult contents and complex hierarchies. Recently, hypergraph-based methods have been proposed for document classifications. Directly adopting existing hypergraph methods on clinical notes cannot sufficiently utilize the hierarchy information of the patient, which can degrade clinical semantic information by (1) frequent neutral words and (2) hierarchies with imbalanced distribution. Thus, we propose a taxonomy-aware multi-level hypergraph neural network (TM-HGNN), where multi-level hypergraphs assemble useful neutral words with rare keywords via note and taxonomy level hyperedges to retain the clinical semantic information. The constructed patient hypergraphs are fed into hierarchical message passing layers for learning more balanced multi-level knowledge at the note and taxonomy levels. We validate the effectiveness of TM-HGNN by conducting extensive experiments with MIMIC-III dataset on benchmark in-hospital-mortality prediction.

2017

The Precision Medicine Track in BioCre-ative VI aims to bring together the Bi-oNLP community for a novel challenge focused on mining the biomedical litera-ture in search of mutations and protein-protein interactions (PPI). In order to support this track with an effective train-ing dataset with limited curator time, the track organizers carefully reviewed Pub-Med articles from two different sources: curated public PPI databases, and the re-sults of state-of-the-art public text mining tools. We detail here the data collection, manual review and annotation process and describe this training corpus charac-teristics. We also describe a corpus per-formance baseline. This analysis will provide useful information to developers and researchers for comparing and devel-oping innovative text mining approaches for the BioCreative VI challenge and other Precision Medicine related applica-tions.

pdf bib abs
Deep Learning for Biomedical Information Retrieval: Learning Textual Relevance from Click Logs
Sunil Mohan | Nicolas Fiorini | Sun Kim | Zhiyong Lu
Proceedings of the 16th BioNLP Workshop

We describe a Deep Learning approach to modeling the relevance of a document’s text to a query, applied to biomedical literature. Instead of mapping each document and query to a common semantic space, we compute a variable-length difference vector between the query and document which is then passed through a deep convolution stage followed by a deep regression network to produce the estimated probability of the document’s relevance to the query. Despite the small amount of training data, this approach produces a more robust predictor than computing similarities between semantic vector representations of the query and document, and also results in significant improvements over traditional IR text factors. In the future, we plan to explore its application in improving PubMed search.

2016

Sun Kim

Fixing paper assignments

2025

2024

2023

2017

2016

2015

2012

Co-authors

Venues