KyungTae Lim

Also published as: Kyungtae Lim

2025

We propose the VLR-Bench, a visual question answering (VQA) benchmark for evaluating vision language models (VLMs) based on retrieval augmented generation (RAG). Unlike existing evaluation datasets for external knowledge-based VQA, the proposed VLR-Bench includes five input passages. This allows testing of the ability to determine which passage is useful for answering a given query, a capability lacking in previous research. In this context, we constructed a dataset of 32,000 automatically generated instruction-following examples, which we denote as VLR-IF. This dataset is specifically designed to enhance the RAG capabilities of VLMs by enabling them to learn how to generate appropriate answers based on input passages. We evaluated the validity of the proposed benchmark and training data and verified its performance using the state-of-the-art Llama3-based VLM, the Llava-Llama-3 model. The proposed VLR-Bench and VLR-IF datasets are publicly available online.

pdf bib abs
SCV: Light and Effective Multi-Vector Retrieval with Sequence Compressive Vectors
Cheoneum Park | Seohyeong Jeong | Minsang Kim | KyungTae Lim | Yong-Hun Lee
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track

Recent advances in language models (LMs) has driven progress in information retrieval (IR), effectively extracting semantically relevant information. However, they face challenges in balancing computational costs with deeper query-document interactions. To tackle this, we present two mechanisms: 1) a light and effective multi-vector retrieval with sequence compression vectors, dubbed SCV and 2) coarse-to-fine vector search. The strengths of SCV stems from its application of span compressive vectors for scoring. By employing a non-linear operation to examine every token in the document, we abstract these into a span-level representation. These vectors effectively reduce the document’s dimensional representation, enabling the model to engage comprehensively with tokens across the entire collection of documents, rather than the subset retrieved by Approximate Nearest Neighbor. Therefore, our framework performs a coarse single vector search during the inference stage and conducts a fine-grained multi-vector search end-to-end. This approach effectively reduces the cost required for search. We empirically show that SCV achieves the fastest latency compared to other state-of-the-art models and can obtain competitive performance on both in-domain and out-of-domain benchmark datasets.

This study explores the integration of automated writing evaluation (AWE) and grammatical error correction (GEC) through multitask learning, demonstrating how combining these distinct tasks can enhance performance in both areas. By leveraging a shared learning framework, we show that models trained jointly on AWE and GEC outperform those trained on each task individually. To support this effort, we introduce a dataset specifically designed for multitask learning using AWE and GEC. Our experiments reveal significant synergies between tasks, leading to improvements in both writing assessment accuracy and error correction precision. This research represents a novel approach for optimizing language learning tools by unifying writing evaluation and correction tasks, offering insights into the potential of multitask learning in educational applications.

The Sejong dictionary dataset offers a valuable resource, providing extensive coverage of morphology, syntax, and semantic representation. This dataset can be utilized to explore linguistic information in greater depth.The labeled linguistic structures within this dataset form the basis for uncovering relationships between words and phrases and their associations with target verbs. This paper introduces a user-friendly web interface designed for the collection and consolidation of verb-related information, with a particular focus on subcategorization frames. Additionally, it outlines our efforts in mapping this information by aligning subcategorization frames with corresponding illustrative sentence examples.Furthermore, we provide a Python library that would simplify syntactic parsing and semantic role labeling. These tools are intended to assist individuals interested in harnessing the Sejong dictionary dataset to develop applications for Korean language processing.

2024

The impressive development of large language models (LLMs) is expanding into the realm of large multimodal models (LMMs), which incorporate multiple types of data beyond text. However, the nature of multimodal models leads to significant expenses in the creation of training data. Furthermore, constructing multilingual data for LMMs presents its own set of challenges due to language diversity and complexity. Therefore, in this study, we propose two cost-effective methods to solve this problem: (1) vocabulary expansion and pretraining of multilingual LLM for specific languages, and (2) automatic and elaborate construction of multimodal datasets using GPT4-V. Based on these methods, we constructed a 91K English-Korean-Chinese multilingual, multimodal training dataset. Additionally, we developed a bilingual multimodal model that exhibits excellent performance in both Korean and English, surpassing existing approaches.

pdf bib abs
When the Misidentified Adverbial Phrase Functions as a Complement
Yige Chen | Kyuwon Kim | KyungTae Lim | Jungyeul Park | Chulwoo Park
Findings of the Association for Computational Linguistics: EMNLP 2024

This study investigates the predicate-argument structure in Korean language processing. Despite the importance of distinguishing mandatory arguments and optional modifiers in sentences, research in this area has been limited. We introduce a dataset with token-level annotations which labels mandatory and optional elements as complements and adjuncts, respectively. Particularly, we reclassify certain Korean phrases, previously misidentified as adverbial phrases, as complements, addressing misuses of the term adjunct in existing Korean treebanks. Utilizing a Korean dependency treebank, we develop an automatic labeling technique for complements and adjuncts. Experiments using the proposed dataset yield satisfying results, demonstrating that the dataset is trainable and reliable.

pdf bib abs
A Linguistically-Informed Annotation Strategy for Korean Semantic Role Labeling
Yige Chen | KyungTae Lim | Jungyeul Park
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Semantic role labeling is an essential component of semantic and syntactic processing of natural languages, which reveals the predicate-argument structure of the language. Despite its importance, semantic role labeling for the Korean language has not been studied extensively. One notable issue is the lack of uniformity among data annotation strategies across different datasets, which often lack thorough rationales. In this study, we suggest an annotation strategy for Korean semantic role labeling that is in line with the previously proposed linguistic theories as well as the distinct properties of the Korean language. We further propose a simple yet viable conversion strategy from the Sejong verb dictionary to a CoNLL-style dataset for Korean semantic role labeling. Experiment results using a transformer-based sequence labeling model demonstrate the reliability and trainability of the converted dataset.

Large language models (LLMs) use pretraining to predict the subsequent word; however, their expansion requires significant computing resources. Numerous big tech companies and research institutes have developed multilingual LLMs (MLLMs) to meet current demands, overlooking less-resourced languages (LRLs). This study proposed three strategies to enhance the performance of LRLs based on the publicly available MLLMs. First, the MLLM vocabularies of LRLs were expanded to enhance expressiveness. Second, bilingual data were used for pretraining to align the high- and less-resourced languages. Third, a high-quality small-scale instruction dataset was constructed and instruction-tuning was performed to augment the LRL. The experiments employed the Llama2 model and Korean was used as the LRL, which was quantitatively evaluated against other developed LLMs across eight tasks. Furthermore, a qualitative assessment was performed based on human evaluation and GPT4. Experimental results showed that our proposed Bllossom model exhibited superior performance in qualitative analyses compared to previously proposed Korean monolingual models.

pdf bib abs
Towards Standardized Annotation and Parsing for Korean FrameNet
Yige Chen | Jae Ihn | KyungTae Lim | Jungyeul Park
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Previous research on Korean FrameNet has produced several datasets that serve as resources for FrameNet parsing in Korean. However, these datasets suffer from the problem that annotations are assigned on the word level, which is not optimally designed based on the agglutinative feature of Korean. To address this issue, we introduce a morphologically enhanced annotation strategy for Korean FrameNet datasets and parsing by leveraging the CoNLL-U format. We present the results of the FrameNet parsers trained on the Korean FrameNet data in the original format and our proposed format, respectively, and further elaborate on the linguistic rationales of our proposed scheme. We suggest the morpheme-based scheme to be the standard of Korean FrameNet data annotation.

2023

In this paper, we introduce the design and various attempts for TaskB of MEDIQA-Chat 2023. The goal of TaskB in MEDIQA-Chat 2023 is to generate full clinical note from doctor-patient consultation dialogues. This task has several challenging issues, such as lack of training data, handling long dialogue inputs, and generating semi-structured clinical note which have section heads. To address these issues, we conducted various experiments and analyzed their results. We utilized the DialogLED model pre-trained on long dialogue data to handle long inputs, and we pre-trained on other dialogue datasets to address the lack of training data. We also attempted methods such as using prompts and contrastive learning for handling sections. This paper provides insights into clinical note generation through analyzing experimental methods and results, and it suggests future research directions.

We present in this work a new Universal Morphology dataset for Korean. Previously, the Korean language has been underrepresented in the field of morphological paradigms amongst hundreds of diverse world languages. Hence, we propose this Universal Morphological paradigms for the Korean language that preserve its distinct characteristics. For our K-UniMorph dataset, we outline each grammatical criterion in detail for the verbal endings, clarify how to extract inflected forms, and demonstrate how we generate the morphological schemata. This dataset adopts morphological feature schema from CITATION and CITATION for the Korean language as we extract inflected verb forms from the Sejong morphologically analyzed corpus that is one of the largest annotated corpora for Korean. During the data creation, our methodology also includes investigating the correctness of the conversion from the Sejong corpus. Furthermore, we carry out the inflection task using three different Korean word forms: letters, syllables and morphemes. Finally, we discuss and describe future perspectives on Korean morphological paradigms and the dataset.

2022

In this study, we propose a morpheme-based scheme for Korean dependency parsing and adopt the proposed scheme to Universal Dependencies. We present the linguistic rationale that illustrates the motivation and the necessity of adopting the morpheme-based format, and develop scripts that convert between the original format used by Universal Dependencies and the proposed morpheme-based format automatically. The effectiveness of the proposed format for Korean dependency parsing is then testified by both statistical and neural models, including UDPipe and Stanza, with our carefully constructed morpheme-based word embedding for Korean. morphUD outperforms parsing results for all Korean UD treebanks, and we also present detailed error analysis.

pdf bib abs
Efficient Multilingual Multi-modal Pre-training through Triple Contrastive Loss
Youhan Lee | KyungTae Lim | Woonhyuk Baek | Byungseok Roh | Saehoon Kim
Proceedings of the 29th International Conference on Computational Linguistics

Learning visual and textual representations in the shared space from web-scale image-text pairs improves the performance of diverse vision-and-language tasks, as well as modality-specific tasks. Many attempts in this framework have been made to connect English-only texts and images, and only a few works have been proposed to extend this framework in multilingual settings with the help of many translation pairs. In this multilingual approach, a typical setup is to use pairs of (image and English-text) and translation pairs. The major limitation of this approach is that the learning signal of aligning visual representation with under-resourced language representation is not strong, achieving a sub-optimal performance of vision-and-language tasks. In this work, we propose a simple yet effective enhancement scheme for previous multilingual multi-modal representation methods by using a limited number of pairs of images and non-English texts. In specific, our scheme fine-tunes a pre-trained multilingual model by minimizing a triplet contrastive loss on triplets of image and two different language texts with the same meaning, improving the connection between images and non-English texts. Experiments confirm that our enhancement strategy achieves performance gains in image-text retrieval, zero-shot image classification, and sentence embedding tasks.

2018

pdf bib
Analyse syntaxique de langues faiblement dotées à partir de plongements de mots multilingues [Syntactic analysis of under-resourced languages from multilingual word embeddings]
KyungTae Lim | Niko Partanen | Thierry Poibeau
Traitement Automatique des Langues, Volume 59, Numéro 3 : Traitement automatique des langues peu dotées [NLP for Under-Resourced Languages]

pdf bib abs
SEx BiST: A Multi-Source Trainable Parser with Deep Contextualized Lexical Representations
KyungTae Lim | Cheoneum Park | Changki Lee | Thierry Poibeau
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We describe the SEx BiST parser (Semantically EXtended Bi-LSTM parser) developed at Lattice for the CoNLL 2018 Shared Task (Multilingual Parsing from Raw Text to Universal Dependencies). The main characteristic of our work is the encoding of three different modes of contextual information for parsing: (i) Treebank feature representations, (ii) Multilingual word representations, (iii) ELMo representations obtained via unsupervised learning from external resources. Our parser performed well in the official end-to-end evaluation (73.02 LAS – 4th/26 teams, and 78.72 UAS – 2nd/26); remarkably, we achieved the best UAS scores on all the English corpora by applying the three suggested feature representations. Finally, we were also ranked 1st at the optional event extraction task, part of the 2018 Extrinsic Parser Evaluation campaign.

pdf bib
Multilingual Dependency Parsing for Low-Resource Languages: Case Studies on North Saami and Komi-Zyrian
KyungTae Lim | Niko Partanen | Thierry Poibeau
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Dependency Parsing of Code-Switching Data with Cross-Lingual Feature Representations
Niko Partanen | Kyungtae Lim | Michael Rießler | Thierry Poibeau
Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages

pdf bib abs
Affordances in Grounded Language Learning
Stephen McGregor | KyungTae Lim
Proceedings of the Eight Workshop on Cognitive Aspects of Computational Language Learning and Processing

We present a novel methodology involving mappings between different modes of semantic representation. We propose distributional semantic models as a mechanism for representing the kind of world knowledge inherent in the system of abstract symbols characteristic of a sophisticated community of language users. Then, motivated by insight from ecological psychology, we describe a model approximating affordances, by which we mean a language learner’s direct perception of opportunities for action in an environment. We present a preliminary experiment involving mapping between these two representational modalities, and propose that our methodology can become the basis for a cognitively inspired model of grounded language learning.

pdf bib abs
The First Komi-Zyrian Universal Dependencies Treebanks
Niko Partanen | Rogier Blokland | KyungTae Lim | Thierry Poibeau | Michael Rießler
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

Two Komi-Zyrian treebanks were included in the Universal Dependencies 2.2 release. This article contextualizes the treebanks, discusses the process through which they were created, and outlines the future plans and timeline for the next improvements. Special attention is paid to the possibilities of using UD in the documentation and description of endangered languages.

2017

pdf bib abs
A System for Multilingual Dependency Parsing based on Bidirectional LSTM Feature Representations
KyungTae Lim | Thierry Poibeau
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

In this paper, we present our multilingual dependency parser developed for the CoNLL 2017 UD Shared Task dealing with “Multilingual Parsing from Raw Text to Universal Dependencies”. Our parser extends the monolingual BIST-parser as a multi-source multilingual trainable parser. Thanks to multilingual word embeddings and one hot encodings for languages, our system can use both monolingual and multi-source training. We trained 69 monolingual language models and 13 multilingual models for the shared task. Our multilingual approach making use of different resources yield better results than the monolingual approach for 11 languages. Our system ranked 5 th and achieved 70.93 overall LAS score over the 81 test corpora (macro-averaged LAS F1 score).

2014

pdf bib abs
Named Entity Corpus Construction using Wikipedia and DBpedia Ontology
Younggyun Hahm | Jungyeul Park | Kyungtae Lim | Youngsik Kim | Dosam Hwang | Key-Sun Choi
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we propose a novel method to automatically build a named entity corpus based on the DBpedia ontology. Since most of named entity recognition systems require time and effort consuming annotation tasks as training data. Work on NER has thus for been limited on certain languages like English that are resource-abundant in general. As an alternative, we suggest that the NE corpus generated by our proposed method, can be used as training data. Our approach introduces Wikipedia as a raw text and uses the DBpedia data set for named entity disambiguation. Our method is language-independent and easy to be applied to many different languages where Wikipedia and DBpedia are provided. Throughout the paper, we demonstrate that our NE corpus is of comparable quality even to the manually annotated NE corpus.