2024
pdf
abs
Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean
ChangSu Choi
|
Yongbin Jeong
|
Seoyoon Park
|
Inho Won
|
HyeonSeok Lim
|
SangMin Kim
|
Yejee Kang
|
Chanhyuk Yoon
|
Jaewan Park
|
Yiseul Lee
|
HyeJin Lee
|
Younggyun Hahm
|
Hansaem Kim
|
KyungTae Lim
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Large language models (LLMs) use pretraining to predict the subsequent word; however, their expansion requires significant computing resources. Numerous big tech companies and research institutes have developed multilingual LLMs (MLLMs) to meet current demands, overlooking less-resourced languages (LRLs). This study proposed three strategies to enhance the performance of LRLs based on the publicly available MLLMs. First, the MLLM vocabularies of LRLs were expanded to enhance expressiveness. Second, bilingual data were used for pretraining to align the high- and less-resourced languages. Third, a high-quality small-scale instruction dataset was constructed and instruction-tuning was performed to augment the LRL. The experiments employed the Llama2 model and Korean was used as the LRL, which was quantitatively evaluated against other developed LLMs across eight tasks. Furthermore, a qualitative assessment was performed based on human evaluation and GPT4. Experimental results showed that our proposed Bllossom model exhibited superior performance in qualitative analyses compared to previously proposed Korean monolingual models.
2023
pdf
abs
Teddysum at MEDIQA-Chat 2023: an analysis of fine-tuning strategy for long dialog summarization
Yongbin Jeong
|
Ju-Hyuck Han
|
Kyung Min Chae
|
Yousang Cho
|
Hyunbin Seo
|
KyungTae Lim
|
Key-Sun Choi
|
Younggyun Hahm
Proceedings of the 5th Clinical Natural Language Processing Workshop
In this paper, we introduce the design and various attempts for TaskB of MEDIQA-Chat 2023. The goal of TaskB in MEDIQA-Chat 2023 is to generate full clinical note from doctor-patient consultation dialogues. This task has several challenging issues, such as lack of training data, handling long dialogue inputs, and generating semi-structured clinical note which have section heads. To address these issues, we conducted various experiments and analyzed their results. We utilized the DialogLED model pre-trained on long dialogue data to handle long inputs, and we pre-trained on other dialogue datasets to address the lack of training data. We also attempted methods such as using prompts and contrastive learning for handling sections. This paper provides insights into clinical note generation through analyzing experimental methods and results, and it suggests future research directions.
2022
pdf
bib
Proceedings of the 29th International Conference on Computational Linguistics
Nicoletta Calzolari
|
Chu-Ren Huang
|
Hansaem Kim
|
James Pustejovsky
|
Leo Wanner
|
Key-Sun Choi
|
Pum-Mo Ryu
|
Hsin-Hsi Chen
|
Lucia Donatelli
|
Heng Ji
|
Sadao Kurohashi
|
Patrizia Paggio
|
Nianwen Xue
|
Seokhwan Kim
|
Younggyun Hahm
|
Zhong He
|
Tony Kyungil Lee
|
Enrico Santus
|
Francis Bond
|
Seung-Hoon Na
Proceedings of the 29th International Conference on Computational Linguistics
2020
pdf
Enhancing Quality of Corpus Annotation: Construction of the Multi-Layer Corpus Annotation and Simplified Validation of the Corpus Annotation
Youngbin Noh
|
Kuntae Kim
|
Minho Lee
|
Cheolhun Heo
|
Yongbin Jeong
|
Yoosung Jeong
|
Younggyun Hahm
|
Taehwan Oh
|
Hyonsu Choe
|
Seokwon Park
|
Jin-Dong Kim
|
Key-Sun Choi
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation
pdf
abs
Crowdsourcing in the Development of a Multilingual FrameNet: A Case Study of Korean FrameNet
Younggyun Hahm
|
Youngbin Noh
|
Ji Yoon Han
|
Tae Hwan Oh
|
Hyonsu Choe
|
Hansaem Kim
|
Key-Sun Choi
Proceedings of the Twelfth Language Resources and Evaluation Conference
Using current methods, the construction of multilingual resources in FrameNet is an expensive and complex task. While crowdsourcing is a viable alternative, it is difficult to include non-native English speakers in such efforts as they often have difficulty with English-based FrameNet tools. In this work, we investigated cross-lingual issues in crowdsourcing approaches for multilingual FrameNets, specifically in the context of the newly constructed Korean FrameNet. To accomplish this, we evaluated the effectiveness of various crowdsourcing settings whereby certain types of information are provided to workers, such as English definitions in FrameNet or translated definitions. We then evaluated whether the crowdsourced results accurately captured the meaning of frames both cross-culturally and cross-linguistically, and found that by allowing the crowd workers to make intuitive choices, they achieved a quality comparable to that of trained FrameNet experts (F1 > 0.75). The outcomes of this work are now publicly available as a new release of Korean FrameNet 1.1.
2018
pdf
Semi-automatic Korean FrameNet Annotation over KAIST Treebank
Younggyun Hahm
|
Jiseong Kim
|
Sunggoo Kwon
|
Key-Sun Choi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
Unsupervised Korean Word Sense Disambiguation using CoreNet
Kijong Han
|
Sangha Nam
|
Jiseong Kim
|
Younggyun Hahm
|
Key-Sun Choi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
Automatic Wordnet Mapping: from CoreNet to Princeton WordNet
Jiseong Kim
|
Younggyun Hahm
|
Sunggoo Kwon
|
Key-Sun Choi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2016
pdf
abs
SRDF: Extracting Lexical Knowledge Graph for Preserving Sentence Meaning
Sangha Nam
|
GyuHyeon Choi
|
Younggyun Hahm
|
Key-Sun Choi
Proceedings of the Open Knowledge Base and Question Answering Workshop (OKBQA 2016)
In this paper, we present an open information extraction system so-called SRDF that generates lexical knowledge graphs from unstructured texts. In semantic web, knowledge is expressed in the RDF triple form but the natural language text consist of multiple relations between arguments. For this reason, we combine open information extraction with the reification for the full text extraction to preserve meaning of sentence in our knowledge graph. And also our knowledge graph is designed to adapt for many existing semantic web applications. At the end of this paper, we introduce the result of the experiment and a Korean template generation module developed using SRDF.
pdf
abs
QAF: Frame Semantics-based Question Interpretation
Younggyun Hahm
|
Sangha Nam
|
Key-Sun Choi
Proceedings of the Open Knowledge Base and Question Answering Workshop (OKBQA 2016)
Natural language questions are interpreted to a sequence of patterns to be matched with instances of patterns in a knowledge base (KB) for answering. A natural language (NL) question answering (QA) system utilizes meaningful patterns matching the syntac-tic/lexical features between the NL questions and KB. In the most of KBs, there are only binary relations in triple form to represent relation between two entities or entity and a value using the domain specific ontology. However, the binary relation representation is not enough to cover complex information in questions, and the ontology vocabulary sometimes does not cover the lexical meaning in questions. Complex meaning needs a knowledge representation to link the binary relation-type triples in KB. In this paper, we propose a frame semantics-based semantic parsing approach as KB-independent question pre-processing. We will propose requirements of question interpretation in the KBQA perspective, and a query form representation based on our proposed format QAF (Ques-tion Answering with the Frame Semantics), which is supposed to cover the requirements. In QAF, frame semantics roles as a model to represent complex information in questions and to disambiguate the lexical meaning in questions to match with the ontology vocabu-lary. Our system takes a question as an input and outputs QAF-query by the process which assigns semantic information in the question to its corresponding frame semantic structure using the semantic parsing rules.
pdf
abs
Korean FrameNet Expansion Based on Projection of Japanese FrameNet
Jeong-uk Kim
|
Younggyun Hahm
|
Key-Sun Choi
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations
FrameNet project has begun from Berkeley in 1997, and is now supported in several countries reflecting characteristics of each language. The work for generating Korean FrameNet was already done by converting annotated English sentences into Korean with trained translators. However, high cost of frame-preservation and error revision was a huge burden on further expansion of FrameNet. This study makes use of linguistic similarity between Japanese and Korean to increase Korean FrameNet corpus with low cost. We also suggest adapting PubAnnotation and Korean-friendly valence patterns to FrameNet for increased accessibility.
2014
pdf
abs
Named Entity Corpus Construction using Wikipedia and DBpedia Ontology
Younggyun Hahm
|
Jungyeul Park
|
Kyungtae Lim
|
Youngsik Kim
|
Dosam Hwang
|
Key-Sun Choi
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
In this paper, we propose a novel method to automatically build a named entity corpus based on the DBpedia ontology. Since most of named entity recognition systems require time and effort consuming annotation tasks as training data. Work on NER has thus for been limited on certain languages like English that are resource-abundant in general. As an alternative, we suggest that the NE corpus generated by our proposed method, can be used as training data. Our approach introduces Wikipedia as a raw text and uses the DBpedia data set for named entity disambiguation. Our method is language-independent and easy to be applied to many different languages where Wikipedia and DBpedia are provided. Throughout the paper, we demonstrate that our NE corpus is of comparable quality even to the manually annotated NE corpus.
2012
pdf
bib
Korean NLP2RDF Resources
YoungGyun Hahm
|
KyungTae Lim
|
Jungyeul Park
|
Yongun Yoon
|
Key-Sun Choi
Proceedings of the 10th Workshop on Asian Language Resources