Hyojun Kim

2025

pdf bib abs
SALAD: Improving Robustness and Generalization through Contrastive Learning with Structure-Aware and LLM-Driven Augmented Data
Suyoung Bae | YunSeok Choi | Hyojun Kim | Jee-Hyong Lee
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

In various natural language processing (NLP) tasks, fine-tuning Pre-trained Language Models (PLMs) often leads to the issue of spurious correlations, which negatively impacts performance, particularly when dealing with out-of-distribution data.To address this problem, we propose **SALAD** (**S**tructure **A**ware and **L**LM-driven **A**ugmented **D**ata), a novel approach designed to enhance model robustness and generalization by generating structure-aware and counterfactually augmented data for contrastive learning.Our method leverages a tagging-based approach to generate structure-aware positive samples and utilizes large language models (LLMs) to generate counterfactual negative samples with diverse sentence patterns. By applying contrastive learning, *SALAD* enables the model to focus on learning the structural relationships between key sentence components while minimizing reliance on spurious correlations.We validate our approach through experiments on three tasks: Sentiment Classification, Sexism Detection, and Natural Language Inference. The results demonstrate that *SALAD* not only improves model robustness and performance across different environments but also enhances generalization to out-of-distribution datasets and cross-domain scenarios.

2023

pdf bib abs
BLOCSUM: Block Scope-based Source Code Summarization via Shared Block Representation
YunSeok Choi | Hyojun Kim | Jee-Hyong Lee
Findings of the Association for Computational Linguistics: ACL 2023

Code summarization, which aims to automatically generate natural language descriptions from source code, has become an essential task in software development for better program understanding. Abstract Syntax Tree (AST), which represents the syntax structure of the source code, is helpful when utilized together with the sequence of code tokens to improve the quality of code summaries. Recent works on code summarization attempted to capture the sequential and structural information of the source code, but they considered less the property that source code consists of multiple code blocks. In this paper, we propose BLOCSUM, BLOck scope-based source Code SUMmarization via shared block representation that utilizes block-scope information by representing various structures of the code block. We propose a shared block position embedding to effectively represent the structure of code blocks and merge both code and AST.Furthermore, we develop variant ASTs to learn rich information such as block and global dependencies of the source code. To prove our approach, we perform experiments on two real-world datasets, the Java dataset and the Python dataset. We demonstrate the effectiveness of BLOCSUM through various experiments, including ablation studies and a human evaluation.

2022

pdf bib abs
TABS: Efficient Textual Adversarial Attack for Pre-trained NL Code Model Using Semantic Beam Search
YunSeok Choi | Hyojun Kim | Jee-Hyong Lee
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

As pre-trained models have shown successful performance in program language processing as well as natural language processing, adversarial attacks on these models also attract attention.However, previous works on black-box adversarial attacks generated adversarial examples in a very inefficient way with simple greedy search. They also failed to find out better adversarial examples because it was hard to reduce the search space without performance loss.In this paper, we propose TABS, an efficient beam search black-box adversarial attack method. We adopt beam search to find out better adversarial examples, and contextual semantic filtering to effectively reduce the search space. Contextual semantic filtering reduces the number of candidate adversarial words considering the surrounding context and the semantic similarity.Our proposed method shows good performance in terms of attack success rate, the number of queries, and semantic similarity in attacking models for two tasks: NL code search classification and retrieval tasks.

Co-authors

Venues

Fix data