Hansaem Kim


2024

pdf
Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean
ChangSu Choi | Yongbin Jeong | Seoyoon Park | Inho Won | HyeonSeok Lim | SangMin Kim | Yejee Kang | Chanhyuk Yoon | Jaewan Park | Yiseul Lee | HyeJin Lee | Younggyun Hahm | Hansaem Kim | KyungTae Lim
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Large language models (LLMs) use pretraining to predict the subsequent word; however, their expansion requires significant computing resources. Numerous big tech companies and research institutes have developed multilingual LLMs (MLLMs) to meet current demands, overlooking less-resourced languages (LRLs). This study proposed three strategies to enhance the performance of LRLs based on the publicly available MLLMs. First, the MLLM vocabularies of LRLs were expanded to enhance expressiveness. Second, bilingual data were used for pretraining to align the high- and less-resourced languages. Third, a high-quality small-scale instruction dataset was constructed and instruction-tuning was performed to augment the LRL. The experiments employed the Llama2 model and Korean was used as the LRL, which was quantitatively evaluated against other developed LLMs across eight tasks. Furthermore, a qualitative assessment was performed based on human evaluation and GPT4. Experimental results showed that our proposed Bllossom model exhibited superior performance in qualitative analyses compared to previously proposed Korean monolingual models.

2022

pdf bib
Proceedings of the 29th International Conference on Computational Linguistics
Nicoletta Calzolari | Chu-Ren Huang | Hansaem Kim | James Pustejovsky | Leo Wanner | Key-Sun Choi | Pum-Mo Ryu | Hsin-Hsi Chen | Lucia Donatelli | Heng Ji | Sadao Kurohashi | Patrizia Paggio | Nianwen Xue | Seokhwan Kim | Younggyun Hahm | Zhong He | Tony Kyungil Lee | Enrico Santus | Francis Bond | Seung-Hoon Na
Proceedings of the 29th International Conference on Computational Linguistics

2020

pdf
Crowdsourcing in the Development of a Multilingual FrameNet: A Case Study of Korean FrameNet
Younggyun Hahm | Youngbin Noh | Ji Yoon Han | Tae Hwan Oh | Hyonsu Choe | Hansaem Kim | Key-Sun Choi
Proceedings of the Twelfth Language Resources and Evaluation Conference

Using current methods, the construction of multilingual resources in FrameNet is an expensive and complex task. While crowdsourcing is a viable alternative, it is difficult to include non-native English speakers in such efforts as they often have difficulty with English-based FrameNet tools. In this work, we investigated cross-lingual issues in crowdsourcing approaches for multilingual FrameNets, specifically in the context of the newly constructed Korean FrameNet. To accomplish this, we evaluated the effectiveness of various crowdsourcing settings whereby certain types of information are provided to workers, such as English definitions in FrameNet or translated definitions. We then evaluated whether the crowdsourced results accurately captured the meaning of frames both cross-culturally and cross-linguistically, and found that by allowing the crowd workers to make intuitive choices, they achieved a quality comparable to that of trained FrameNet experts (F1 > 0.75). The outcomes of this work are now publicly available as a new release of Korean FrameNet 1.1.

pdf
Building Korean Abstract Meaning Representation Corpus
Hyonsu Choe | Jiyoon Han | Hyejin Park | Tae Hwan Oh | Hansaem Kim
Proceedings of the Second International Workshop on Designing Meaning Representations

To explore the potential sembanking in Korean and ways to represent the meaning of Korean sentences, this paper reports on the process of applying Abstract Meaning Representation to Korean, a semantic representation framework that has been studied in wide range of languages, and its output: the Korean AMR corpus. The corpus which is constructed so far is a size of 1,253 sentences and its raw texts are from ExoBrain Corpus, a state-led R&D project on language AI. This paper also analyzes the result in both qualitative and quantitative manners, proposing discussions for further development.

pdf
Analysis of the Penn Korean Universal Dependency Treebank (PKT-UD): Manual Revision to Build Robust Parsing Model in Korean
Tae Hwan Oh | Ji Yoon Han | Hyonsu Choe | Seokwon Park | Han He | Jinho D. Choi | Na-Rae Han | Jena D. Hwang | Hansaem Kim
Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies

In this paper, we first open on important issues regarding the Penn Korean Universal Treebank (PKT-UD) and address these issues by revising the entire corpus manually with the aim of producing cleaner UD annotations that are more faithful to Korean grammar. For compatibility to the rest of UD corpora, we follow the UDv2 guidelines, and extensively revise the part-of-speech tags and the dependency relations to reflect morphological features and flexible word- order aspects in Korean. The original and the revised versions of PKT-UD are experimented with transformer-based parsing models using biaffine attention. The parsing model trained on the revised corpus shows a significant improvement of 3.0% in labeled attachment score over the model trained on the previous corpus. Our error analysis demonstrates that this revision allows the parsing model to learn relations more robustly, reducing several critical errors that used to be made by the previous model.

pdf
Annotation Issues in Universal Dependencies for Korean and Japanese
Ji Yoon Han | Tae Hwan Oh | Lee Jin | Hansaem Kim
Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)

To investigate issues that arise in the process of developing a Universal Dependency (UD) treebank for Korean and Japanese, we begin by addressing the typological characteristics of Korean and Japanese. Both Korean and Japanese are agglutinative and head-final languages. And the principle of word segmentation for both languages is different from English, which makes it difficult to apply UD guidelines. Following the typological characteristics of the two languages and the issue of UD application, we review the application of UPOS and DEPREL schemes to the two languages. The annotation principles for AUX, ADJ, DET, ADP and PART are discussed for the UPOS scheme, and the annotation principles for case, aux, iobj, and obl are discussed for the DEPREL scheme.

2019

pdf
Copula and Case-Stacking Annotations for Korean AMR
Hyonsu Choe | Jiyoon Han | Hyejin Park | Hansaem Kim
Proceedings of the First International Workshop on Designing Meaning Representations

This paper concerns the application of Abstract Meaning Representation (AMR) to Korean. In this regard, it focuses on the copula construction and its negation and the case-stacking phenomenon thereof. To illustrate this clearly, we reviewed the :domain annotation scheme from various perspectives. In this process, the existing annotation guidelines were improved to devise annotation schemes for each issue under the principle of pursuing consistency and efficiency of annotation without distorting the characteristics of Korean.

2018

pdf
Enhancing Universal Dependencies for Korean
Youngbin Noh | Jiyoon Han | Tae Hwan Oh | Hansaem Kim
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

In this paper, for the purpose of enhancing Universal Dependencies for the Korean language, we propose a modified method for mapping Korean Part-of-Speech(POS) tagset in relation to Universal Part-of-Speech (UPOS) tagset in order to enhance the Universal Dependencies for the Korean Language. Previous studies suggest that UPOS reflects several issues that influence dependency annotation by using the POS of Korean predicates, particularly the distinctiveness in using verb, adjective, and copula.

2009

pdf
Word Segmentation Standard in Chinese, Japanese and Korean
Key-Sun Choi | Hitoshi Isahara | Kyoko Kanzaki | Hansaem Kim | Seok Mun Pak | Maosong Sun
Proceedings of the 7th Workshop on Asian Language Resources (ALR7)