Junho Lee


2024

pdf bib
AutoAugment Is What You Need: Enhancing Rule-based Augmentation Methods in Low-resource Regimes
Juhwan Choi | Kyohoon Jin | Junho Lee | Sangmin Song | YoungBin Kim
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

Text data augmentation is a complex problem due to the discrete nature of sentences. Although rule-based augmentation methods are widely adopted in real-world applications because of their simplicity, they suffer from potential semantic damage. Previous researchers have suggested easy data augmentation with soft labels (softEDA), employing label smoothing to mitigate this problem. However, finding the best factor for each model and dataset is challenging; therefore, using softEDA in real-world applications is still difficult. In this paper, we propose adapting AutoAugment to solve this problem. The experimental results suggest that the proposed method can boost existing augmentation methods and that rule-based methods can enhance cutting-edge pretrained language models. We offer the source code.

pdf
Enhancing Effectiveness and Robustness in a Low-Resource Regime via Decision-Boundary-aware Data Augmentation
Kyohoon Jin | Junho Lee | Juhwan Choi | Sangmin Song | Youngbin Kim
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Efforts to leverage deep learning models in low-resource regimes have led to numerous augmentation studies. However, the direct application of methods, such as mixup and cutout, is limited due to the discrete characteristics of the textual data. While methods using pre trained language models have exhibited good efficiency, they require additional considerations for robustness. Inspired by recent studies on decision boundaries, this paper proposes a decision-boundary-aware data augmentation strategy to enhance robustness using pretrained language models. The proposed technique first focuses on shifting the latent features closer to the decision boundary, followed by reconstruction to generate an ambiguous version with a soft label. Additionally, mid-K sampling is suggested to enhance the diversity of the generated sentences. This paper demonstrates the performance of the proposed augmentation strategy compared to other methods through extensive experiments. Furthermore, the ablation study demonstrates the effect of soft labels and mid-K sampling and the extensibility of the method with curriculum data augmentation.

pdf
User Guide for KOTE: Korean Online That-gul Emotions Dataset
Duyoung Jeon | Junho Lee | Cheongtag Kim
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Despite the lack of comprehensive exploration of emotional connotations, sentiment analysis, which categorizes data as positive or negative, has been widely employed to identify emotional aspects in texts. Recently, corpora labeled with more than just valence or polarity have been built to surpass this limitation. However, most Korean emotion corpora are limited by their small size and narrow range of emotions covered. In this paper, we introduce the KOTE dataset. The KOTE dataset comprises 50,000 Korean online comments, totaling 250,000 cases, each manually labeled for 43 emotions and NO EMOTION through crowdsourcing. The taxonomy for the 43 emotions was systematically derived through cluster analysis of Korean emotion concepts within the word embedding space. After detailing the development of KOTE, we further discuss the results of fine-tuning, as well as analysis for social discrimination within the corpus.