Kyungwoo Song

2025

pdf bib abs
LBC: Language-Based-Classifier for Out-Of-Variable Generalization
Kangjun Noh | Baekryun Seong | Hoyoon Byun | Youngjun Choi | Sungjin Song | Kyungwoo Song
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large Language Models (LLMs) have great success in natural language processing tasks such as response generation. However, their use in tabular data has been limited due to their inferior performance compared to traditional machine learning models (TMLs) such as XGBoost. We find that the pre-trained knowledge of LLMs enables them to interpret new variables that appear in a test without additional training, a capability central to the concept of Out-of-Variable (OOV). From the findings, we propose a Language-Based-Classifier (LBC), a classifier that maximizes the benefits of LLMs to outperform TMLs on OOV tasks. LBC employs three key methodological strategies: 1) Categorical changes to adjust data to better fit the model’s understanding, 2) Advanced order and indicator to enhance data representation to the model, and 3) Using verbalizer to map logit scores to classes during inference to generate model predictions. These strategies, combined with the pre-trained knowledge of LBC, emphasize the model’s ability to effectively handle OOV tasks. We empirically and theoretically validate the superiority of LBC. LBC is the first study to apply an LLM-based model to OOV tasks. The source code is at https://github.com/ASDASDanonymous/Language-Based-Classifier-forOOVtasks.

2024

Detecting out-of-distribution (OOD) samples is crucial for ensuring the safety and robustness of models deployed in real-world scenarios. While most studies on OOD detection focus on fine-tuned models trained on in-distribution (ID) data, detecting OOD in pre-trained models is also important due to computational limitations and the widespread use of open-source pre-trained models. However, in the same domain shift setting, the OOD detection performance of pre-trained models is insufficient because both ID and OOD samples originate from the same domain, leading to a high overlap in their embeddings. To address this issue, we introduce a new method called CED, a training-free OOD detection technique designed to enhance the distinction between ID and OOD datasets. We theoretically validate that specific auxiliary and oracle samples that satisfy certain conditions improve this distinction. Motivated by our theoretical analysis, CED enhances the differentiation by utilizing these specially designed auxiliary and oracle samples. As a result, CED significantly improves the ability of pre-trained models to distinguish between ID and OOD samples in text classification and hallucination detection tasks. Furthermore, we verify that CED is a plug-and-play method compatible with various backbone networks, such as RoBERTa, Llama, and OpenAI Embedding.

Approaches to improving multilingual language understanding often struggle with significant performance gaps between high-resource and low-resource languages. While there are efforts to align the languages in a single latent space to mitigate such gaps, how different input-level representations influence such gaps has not been investigated, particularly with phonemic inputs. We hypothesize that the performance gaps are affected by representation discrepancies between those languages, and revisit the use of phonemic representations as a means to mitigate these discrepancies.To demonstrate the effectiveness of phonemic representations, we present experiments on three representative cross-lingual tasks on 12 languages in total. The results show that phonemic representations exhibit higher similarities between languages compared to orthographic representations, and it consistently outperforms grapheme-based baseline model on languages that are relatively low-resourced.We present quantitative evidence from three cross-lingual tasks that demonstrate the effectiveness of phonemic representations, and it is further justified by a theoretical analysis of the cross-lingual performance gap.

2020

pdf bib abs
Neutralizing Gender Bias in Word Embeddings with Latent Disentanglement and Counterfactual Generation
Seungjae Shin | Kyungwoo Song | JoonHo Jang | Hyemi Kim | Weonyoung Joo | Il-Chul Moon
Findings of the Association for Computational Linguistics: EMNLP 2020

Recent research demonstrates that word embeddings, trained on the human-generated corpus, have strong gender biases in embedding spaces, and these biases can result in the discriminative results from the various downstream tasks. Whereas the previous methods project word embeddings into a linear subspace for debiasing, we introduce a Latent Disentanglement method with a siamese auto-encoder structure with an adapted gradient reversal layer. Our structure enables the separation of the semantic latent information and gender latent information of given word into the disjoint latent dimensions. Afterwards, we introduce a Counterfactual Generation to convert the gender information of words, so the original and the modified embeddings can produce a gender-neutralized word embedding after geometric alignment regularization, without loss of semantic information. From the various quantitative and qualitative debiasing experiments, our method shows to be better than existing debiasing methods in debiasing word embeddings. In addition, Our method shows the ability to preserve semantic information during debiasing by minimizing the semantic information losses for extrinsic NLP downstream tasks.