Youngjoong Ko


2024

pdf
Hyper-QKSG: Framework for Automating Query Generation and Knowledge-Snippet Extraction from Tables and Lists
Dooyoung Kim | Yoonjin Jang | Dongwook Shin | Chanhoon Park | Youngjoong Ko
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

These days, there is an increasing necessity to provide a user with a short knowledge-snippet for a query in commercial information retrieval services such as the featured snippet of Google. In this paper, we focus on how to automatically extract the candidates of query-knowledge snippet pairs from structured HTML documents by using a new Language Model (HTML-PLM). In particular, the proposed system is powerful on extracting them from Tables and Lists, and provides a new framework for automate query generation and knowledge-snippet extraction based on a QA-pair filtering procedure including the snippet refinement and verification processes, which enhance the quality of generated query-knowledge snippet pairs. As a result, 53.8% of the generated knowledge-snippets includes complex HTML structures such as tables and lists in our experiments of a real-world environments, and 66.5% of the knowledge-snippets are evaluated as valid.

pdf
RAC: Retrieval-augmented Conversation Dataset for Open-domain Question Answering in Conversational Settings
Bonggeun Choi | JeongJae Park | Yoonsung Kim | Jaehyun Park | Youngjoong Ko
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

In recent years, significant advancements in conversational question and answering (CQA) have been driven by the exponential growth of large language models and the integration of retrieval mechanisms that leverage external knowledge to generate accurate and contextually relevant responses. Consequently, the fields of conversational search and retrieval-augmented generation (RAG) have obtained substantial attention for their capacity to address two key challenges: query rewriting within conversational histories for better retrieval performance and generating responses by employing retrieved knowledge. However, both fields are often independently studied, and comprehensive study on entire systems remains underexplored. In this work, we present a novel retrieval-augmented conversation (RAC) dataset and develop a baseline system comprising query rewriting, retrieval, reranking, and response generation stages. Experimental results demonstrate the competitiveness of the system and extensive analyses are conducted to apprehend the impact of retrieval results to response generation.

2023

pdf
Contrastively Pretrained Vision-Language Transformers and Domain Adaptation Methods for Multimodal TOD Systems
Youngjae Chang | Doo Young Kim | Jinyoung Kim | Keunha Kim | Hyunmook Cha | Suyoung Min | Youngjoong Ko | Kye-Hwan Lee | Joonwoo Park
Proceedings of The Eleventh Dialog System Technology Challenge

The Situated Interactive MultiModal Conversations (SIMMC2.1) Challenge 2022 is hosted by the Eleventh Dialog System Technology Challenge (DSTC11). This is the third consecutive year multimodal dialog systems have been selected as an official track of the competition, promoted by the continued interest in the research community. The task of SIMMC is to create a shopping assistant agent that can communicate with customers in a virtual store. It requires processing store scenes and product catalogs along with the customer’s request. The task is decomposed into four steps and each becomes a subtask. In this work, we explore the common approaches to modeling multimodality and find the method with the most potential. We also identify a discrepancy in using pretrained language models for dialog tasks and devise a simple domain-adaptation method. Our model came in third place for object coreferencing, dialog state tracking, and response generation tasks.

pdf
Topic-Informed Dialogue Summarization using Topic Distribution and Prompt-based Modeling
Jaeah You | Youngjoong Ko
Findings of the Association for Computational Linguistics: EMNLP 2023

Dealing with multiple topics should be considered an important issue in dialogue summarization, because dialogues, unlike documents, are prone to topic drift. Thus, we propose a new dialogue summarization model that reflects dialogue topic distribution to consider all topics present in the dialogue. First, the distribution of dialogue topics is estimated by an effective topic discovery model. Then topic-informed prompt transfers estimated topic distribution information to the output of encoder and decoder vectors. Finally, the topic extractor estimates the summary topic distribution from the output context vector of decoder to distinguish its difference from the dialogue topic distribution. To consider the proportion of each topic distribution appeared in the dialogue, the extractor is trained to reduce the difference between the distributions of the dialogue and the summary. The experimental results on SAMSum and DialogSum show that our model outperforms state-of-the-art methods on ROUGE scores. The human evaluation results also show that our framework well generates comprehensive summaries.

2021

pdf
Graph-based Fake News Detection using a Summarization Technique
Gihwan Kim | Youngjoong Ko
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Nowadays, fake news is spreading in various ways, and this fake information is causing a lot of social damages. Thus the need to detect fake information is increasing to prevent the damages caused by fake news. In this paper, we propose a novel graph-based fake news detection method using a summarization technique that uses only the document internal information. Our proposed method represents the relationship between all sentences using a graph and the reflection rate of contextual information among sentences is computed by using an attention mechanism. In addition, we improve the performance of fake news detection by utilizing summary information as an important subject of the document. The experimental results demonstrate that our method achieves high accuracy, 91.04%, that is 8.85%p better than the previous method.

pdf
Fine-grained Post-training for Improving Retrieval-based Dialogue Systems
Janghoon Han | Taesuk Hong | Byoungjae Kim | Youngjoong Ko | Jungyun Seo
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Retrieval-based dialogue systems display an outstanding performance when pre-trained language models are used, which includes bidirectional encoder representations from transformers (BERT). During the multi-turn response selection, BERT focuses on training the relationship between the context with multiple utterances and the response. However, this method of training is insufficient when considering the relations between each utterance in the context. This leads to a problem of not completely understanding the context flow that is required to select a response. To address this issue, we propose a new fine-grained post-training method that reflects the characteristics of the multi-turn dialogue. Specifically, the model learns the utterance level interactions by training every short context-response pair in a dialogue session. Furthermore, by using a new training objective, the utterance relevance classification, the model understands the semantic relevance and coherence between the dialogue utterances. Experimental results show that our model achieves new state-of-the-art with significant margins on three benchmark datasets. This suggests that the fine-grained post-training method is highly effective for the response selection task.

2020

pdf
Multi-Task Learning for Knowledge Graph Completion with Pre-trained Language Models
Bosung Kim | Taesuk Hong | Youngjoong Ko | Jungyun Seo
Proceedings of the 28th International Conference on Computational Linguistics

As research on utilizing human knowledge in natural language processing has attracted considerable attention in recent years, knowledge graph (KG) completion has come into the spotlight. Recently, a new knowledge graph completion method using a pre-trained language model, such as KG-BERT, is presented and showed high performance. However, its scores in ranking metrics such as Hits@k are still behind state-of-the-art models. We claim that there are two main reasons: 1) failure in sufficiently learning relational information in knowledge graphs, and 2) difficulty in picking out the correct answer from lexically similar candidates. In this paper, we propose an effective multi-task learning method to overcome the limitations of previous works. By combining relation prediction and relevance ranking tasks with our target link prediction, the proposed model can learn more relational properties in KGs and properly perform even when lexical similarity occurs. Experimental results show that we not only largely improve the ranking performances compared to KG-BERT but also achieve the state-of-the-art performances in Mean Rank and Hits@10 on the WN18RR dataset.

2018

pdf
Word Sense Disambiguation Based on Word Similarity Calculation Using Word Vector Representation from a Knowledge-based Graph
Dongsuk O | Sunjae Kwon | Kyungsun Kim | Youngjoong Ko
Proceedings of the 27th International Conference on Computational Linguistics

Word sense disambiguation (WSD) is the task to determine the word sense according to its context. Many existing WSD studies have been using an external knowledge-based unsupervised approach because it has fewer word set constraints than supervised approaches requiring training data. In this paper, we propose a new WSD method to generate the context of an ambiguous word by using similarities between an ambiguous word and words in the input document. In addition, to leverage our WSD method, we further propose a new word similarity calculation method based on the semantic network structure of BabelNet. We evaluate the proposed methods on the SemEval-13 and SemEval-15 for English WSD dataset. Experimental results demonstrate that the proposed WSD method significantly improves the baseline WSD method. Furthermore, our WSD system outperforms the state-of-the-art WSD systems in the Semeval-13 dataset. Finally, it has higher performance than the state-of-the-art unsupervised knowledge-based WSD system in the average performance of both datasets.

2017

pdf
A Method to Generate a Machine-Labeled Data for Biomedical Named Entity Recognition with Various Sub-Domains
Juae Kim | Sunjae Kwon | Youngjoong Ko | Jungyun Seo
Proceedings of the International Workshop on Digital Disease Detection using Social Media 2017 (DDDSM-2017)

Biomedical Named Entity (NE) recognition is a core technique for various works in the biomedical domain. In previous studies, using machine learning algorithm shows better performance than dictionary-based and rule-based approaches because there are too many terminological variations of biomedical NEs and new biomedical NEs are constantly generated. To achieve the high performance with a machine-learning algorithm, good-quality corpora are required. However, it is difficult to obtain the good-quality corpora because an-notating a biomedical corpus for ma-chine-learning is extremely time-consuming and costly. In addition, most previous corpora are insufficient for high-level tasks because they cannot cover various domains. Therefore, we propose a method for generating a large amount of machine-labeled data that covers various domains. To generate a large amount of machine-labeled data, firstly we generate an initial machine-labeled data by using a chunker and MetaMap. The chunker is developed to extract only biomedical NEs with manually annotated data. MetaMap is used to annotate the category of bio-medical NE. Then we apply the self-training approach to bootstrap the performance of initial machine-labeled data. In our experiments, the biomedical NE recognition system that is trained with our proposed machine-labeled data achieves much high performance. As a result, our system outperforms biomedical NE recognition system that using MetaMap only with 26.03%p improvements on F1-score.

2015

pdf
A Simultaneous Recognition Framework for the Spoken Language Understanding Module of Intelligent Personal Assistant Software on Smart Phones
Changsu Lee | Youngjoong Ko | Jungyun Seo
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

2011

pdf
Extracting Comparative Entities and Predicates from Texts Using Comparative Type Classification
Seon Yang | Youngjoong Ko
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2009

pdf
Extracting Comparative Sentences from Korean Text Documents Using Comparative Lexical Patterns and Machine Learning Techniques
Seon Yang | Youngjoong Ko
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

2005

pdf
Improving Korean Speech Acts Analysis by Using Shrinkage and Discourse Stack
Kyungsun Kim | Youngjoong Ko | Jungyun Seo
Second International Joint Conference on Natural Language Processing: Full Papers

2004

pdf
Learning with Unlabeled Data for Text Categorization Using a Bootstrapping and a Feature Projection Technique
Youngjoong Ko | Jungyun Seo
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)

2002

pdf
Text Categorization using Feature Projections
Youngjoong Ko | Jungyun Seo
COLING 2002: The 19th International Conference on Computational Linguistics

pdf
Automatic Text Categorization using the Importance of Sentences
Youngjoong Ko | Jinwoo Park | Jungyun Seo
COLING 2002: The 19th International Conference on Computational Linguistics

2000

pdf
Automatic Text Categorization by Unsupervised Learning
Youngjoong Ko | Jungyun Seo
COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics