Yongil Kim


2023

pdf
Injecting Comparison Skills in Task-Oriented Dialogue Systems for Database Search Results Disambiguation
Yongil Kim | Yerin Hwang | Joongbo Shin | Hyunkyung Bae | Kyomin Jung
Findings of the Association for Computational Linguistics: ACL 2023

In task-oriented dialogue (TOD) systems designed to aid users accomplish specific goals in one or more domains, the agent retrieves entities that satisfy user constraints from the database. However, when multiple database search results exist, an ambiguity occurs regarding which results to select and present to the user. Existing TOD systems handle this ambiguity by randomly selecting one or few results and presenting their names to the user. However, in a real scenario, users do not always accept a randomly recommended entity, and users should have access to more comprehensive information about the search results. To address this limitation, we propose a novel task called Comparison-Based database search Ambiguity handling (CBA), which handles ambiguity in database search results by comparing the properties of multiple entities to enable users to choose according to their preferences. Accordingly, we introduce a new framework for automatically collecting high-quality dialogue data along with the Disambiguating Schema-guided Dialogue (DSD) dataset, an augmented version of the SGD dataset. Experimental studies on the DSD dataset demonstrate that training baseline models with the dataset effectively address the CBA task. Our dataset and code will be publicized.

2022

pdf
Modality Alignment between Deep Representations for Effective Video-and-Language Learning
Hyeongu Yun | Yongil Kim | Kyomin Jung
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Video-and-Language learning, such as video question answering or video captioning, is the next challenge in the deep learning society, as it pursues the way how human intelligence perceives everyday life. These tasks require the ability of multi-modal reasoning which is to handle both visual information and text information simultaneously across time. In this point of view, a cross-modality attention module that fuses video representation and text representation takes a critical role in most recent approaches. However, existing Video-and-Language models merely compute the attention weights without considering the different characteristics of video modality and text modality. Such na ̈ıve attention module hinders the current models to fully enjoy the strength of cross-modality. In this paper, we propose a novel Modality Alignment method that benefits the cross-modality attention module by guiding it to easily amalgamate multiple modalities. Specifically, we exploit Centered Kernel Alignment (CKA) which was originally proposed to measure the similarity between two deep representations. Our method directly optimizes CKA to make an alignment between video and text embedding representations, hence it aids the cross-modality attention module to combine information over different modalities. Experiments on real-world Video QA tasks demonstrate that our method outperforms conventional multi-modal methods significantly with +3.57% accuracy increment compared to the baseline in a popular benchmark dataset. Additionally, in a synthetic data environment, we show that learning the alignment with our method boosts the performance of the cross-modality attention.