Hwanhee Lee


2024

pdf
FIZZ: Factual Inconsistency Detection by Zoom-in Summary and Zoom-out Document
Joonho Yang | Seunghyun Yoon | ByeongJeong Kim | Hwanhee Lee
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Through the advent of pre-trained language models, there have been notable advancements in abstractive summarization systems. Simultaneously, a considerable number of novel methods for evaluating factual consistency in abstractive summarization systems has been developed. But these evaluation approaches incorporate substantial limitations, especially on refinement and interpretability. In this work, we propose highly effective and interpretable factual inconsistency detection method FIZZ (Factual Inconsistency Detection by Zoom-in Summary and Zoom-out Document) for abstractive summarization systems that is based on fine-grained atomic facts decomposition. Moreover, we align atomic facts decomposed from the summary with the source document through adaptive granularity expansion. These atomic facts represent a more fine-grained unit of information, facilitating detailed understanding and interpretability of the summary’s factual inconsistency. Experimental results demonstrate that our proposed factual consistency checking system significantly outperforms existing systems. We release the code at https://github.com/plm3332/FIZZ.

pdf
IterCQR: Iterative Conversational Query Reformulation with Retrieval Guidance
Yunah Jang | Kang-il Lee | Hyunkyung Bae | Hwanhee Lee | Kyomin Jung
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Conversational search aims to retrieve passages containing essential information to answer queries in a multi-turn conversation. In conversational search, reformulating context-dependent conversational queries into stand-alone forms is imperative to effectively utilize off-the-shelf retrievers. Previous methodologies for conversational query reformulation frequently depend on human-annotated rewrites.However, these manually crafted queries often result in sub-optimal retrieval performance and require high collection costs.To address these challenges, we propose **Iter**ative **C**onversational **Q**uery **R**eformulation (**IterCQR**), a methodology that conducts query reformulation without relying on human rewrites. IterCQR iteratively trains the conversational query reformulation (CQR) model by directly leveraging information retrieval (IR) signals as a reward.Our IterCQR training guides the CQR model such that generated queries contain necessary information from the previous dialogue context.Our proposed method shows state-of-the-art performance on two widely-used datasets, demonstrating its effectiveness on both sparse and dense retrievers. Moreover, IterCQR exhibits superior performance in challenging settings such as generalization on unseen datasets and low-resource scenarios.

pdf
LifeTox: Unveiling Implicit Toxicity in Life Advice
Minbeom Kim | Jahyun Koo | Hwanhee Lee | Joonsuk Park | Hwaran Lee | Kyomin Jung
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

As large language models become increasingly integrated into daily life, detecting implicit toxicity across diverse contexts is crucial. To this end, we introduce LifeTox, a dataset designed for identifying implicit toxicity within a broad range of advice-seeking scenarios. Unlike existing safety datasets, LifeTox comprises diverse contexts derived from personal experiences through open-ended questions. Our experiments demonstrate that RoBERTa fine-tuned on LifeTox matches or surpasses the zero-shot performance of large language models in toxicity classification tasks. These results underscore the efficacy of LifeTox in addressing the complex challenges inherent in implicit toxicity. We open-sourced the dataset and the LifeTox moderator family; 350M, 7B, and 13B.

pdf
Low-Resource Cross-Lingual Summarization through Few-Shot Learning with Large Language Models
Gyutae Park | Seojin Hwang | Hwanhee Lee
Proceedings of the Seventh Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2024)

Cross-lingual summarization (XLS) aims to generate a summary in a target language different from the source language document. While large language models (LLMs) have shown promising zero-shot XLS performance, their few-shot capabilities on this task remain unexplored, especially for low-resource languages with limited parallel data. In this paper, we investigate the few-shot XLS performance of various models, including Mistral-7B-Instruct-v0.2, GPT-3.5, and GPT-4.Our experiments demonstrate that few-shot learning significantly improves the XLS performance of LLMs, particularly GPT-3.5 and GPT-4, in low-resource settings. However, the open-source model Mistral-7B-Instruct-v0.2 struggles to adapt effectively to the XLS task with limited examples. Our findings highlight the potential of few-shot learning for improving XLS performance and the need for further research in designing LLM architectures and pre-training objectives tailored for this task. We provide a future work direction to explore more effective few-shot learning strategies and to investigate the transfer learning capabilities of LLMs for cross-lingual summarization.

pdf
KoCoSa: Korean Context-aware Sarcasm Detection Dataset
Yumin Kim | Heejae Suh | Mingi Kim | Dongyeon Won | Hwanhee Lee
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Sarcasm is a way of verbal irony where someone says the opposite of what they mean, often to ridicule a person, situation, or idea. It is often difficult to detect sarcasm in the dialogue since detecting sarcasm should reflect the context (i.e., dialogue history). In this paper, we introduce a new dataset for the Korean dialogue sarcasm detection task, KoCoSa (Korean Context-aware Sarcasm Detection Dataset), which consists of 12.8K daily Korean dialogues and the labels for this task on the last response. To build the dataset, we propose an efficient sarcasm detection dataset generation pipeline: 1) generating new sarcastic dialogues from source dialogues with large language models, 2) automatic and manual filtering of abnormal and toxic dialogues, and 3) human annotation for the sarcasm detection task. We also provide a simple but effective baseline for the Korean sarcasm detection task trained on our dataset. Experimental results on the dataset show that our baseline system outperforms strong baselines like large language models, such as GPT-3.5, in the Korean sarcasm detection task. We show that the sarcasm detection task relies deeply on the existence of sufficient context. We will release the dataset at https://github.com/Yu-billie/KoCoSa_sarcasm_detection.

pdf
Kosmic: Korean Text Similarity Metric Reflecting Honorific Distinctions
Yerin Hwang | Yongil Kim | Hyunkyung Bae | Jeesoo Bang | Hwanhee Lee | Kyomin Jung
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Existing English-based text similarity measurements primarily focus on the semantic dimension, neglecting the unique linguistic attributes found in languages like Korean, where honorific expressions are explicitly integrated. To address this limitation, this study proposes Kosmic, a novel Korean text-similarity metric that encompasses the semantic and tonal facets of a given text pair. For the evaluation, we introduce a novel benchmark annotated by human experts, empirically showing that Kosmic outperforms the existing method. Moreover, by leveraging Kosmic, we assess various Korean paraphrasing methods to determine which techniques are most effective in preserving semantics and tone.

2023

pdf
Task-specific Compression for Multi-task Language Models using Attribution-based Pruning
Nakyeong Yang | Yunah Jang | Hwanhee Lee | Seohyeong Jeong | Kyomin Jung
Findings of the Association for Computational Linguistics: EACL 2023

Multi-task language models show outstanding performance for various natural language understanding tasks with only a single model. However, these language models inevitably utilize an unnecessarily large number of model parameters, even when used only for a specific task. In this paper, we propose a novel training-free compression method for multi-task language models using pruning method. Specifically, we use an attribution method to determine which neurons are essential for performing a specific task. We task-specifically prune unimportant neurons and leave only task-specific parameters. Furthermore, we extend our method to be applicable in both low-resource and unsupervised settings. Since our compression method is training-free, it uses little computing resources and does not update the pre-trained parameters of language models, reducing storage space usage. Experimental results on the six widely-used datasets show that our proposed pruning method significantly outperforms baseline pruning methods. In addition, we demonstrate that our method preserves performance even in an unseen domain setting.

pdf
Critic-Guided Decoding for Controlled Text Generation
Minbeom Kim | Hwanhee Lee | Kang Min Yoo | Joonsuk Park | Hwaran Lee | Kyomin Jung
Findings of the Association for Computational Linguistics: ACL 2023

Steering language generation towards objectives or away from undesired content has been a long-standing goal in utilizing language models (LM). Recent work has demonstrated reinforcement learning and weighted decoding as effective approaches to achieve a higher level of language control and quality with pros and cons. In this work, we propose a novel critic decoding method for controlled language generation (CriticControl) that combines the strengths of reinforcement learning and weighted decoding. Specifically, we adopt the actor-critic framework and train an LM-steering critic from reward models. Similar to weighted decoding, our method freezes the language model and manipulates the output token distribution using a critic to improve training efficiency and stability. Evaluation of our method on three controlled generation tasks, topic control, sentiment control, and detoxification, shows that our approach generates more coherent and well-controlled texts than previous methods. In addition, CriticControl demonstrates superior generalization ability in zero-shot settings. Human evaluation studies also corroborate our findings.

pdf
Asking Clarification Questions to Handle Ambiguity in Open-Domain QA
Dongryeol Lee | Segwang Kim | Minwoo Lee | Hwanhee Lee | Joonsuk Park | Sang-Woo Lee | Kyomin Jung
Findings of the Association for Computational Linguistics: EMNLP 2023

Ambiguous questions persist in open-domain question answering, because formulating a precise question with a unique answer is often challenging. Previous works have tackled this issue by asking disambiguated questions for all possible interpretations of the ambiguous question. Instead, we propose to ask a clarification question, where the user’s response will help identify the interpretation that best aligns with the user’s intention. We first present CAmbigNQ, a dataset consisting of 5,653 ambiguous questions, each with relevant passages, possible answers, and a clarification question. The clarification questions were efficiently created by generating them using InstructGPT and manually revising them as necessary. We then define a pipeline of three tasks—(1) ambiguity detection, (2) clarification question generation, and (3) clarification-based QA. In the process, we adopt or design appropriate evaluation metrics to facilitate sound research. Lastly, we achieve F1 of 61.3, 25.1, and 40.5 on the three tasks, demonstrating the need for further improvements while providing competitive baselines for future work.

pdf
Dialogizer: Context-aware Conversational-QA Dataset Generation from Textual Sources
Yerin Hwang | Yongil Kim | Hyunkyung Bae | Hwanhee Lee | Jeesoo Bang | Kyomin Jung
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

To address the data scarcity issue in Conversational question answering (ConvQA), a dialog inpainting method, which utilizes documents to generate ConvQA datasets, has been proposed. However, the original dialog inpainting model is trained solely on the dialog reconstruction task, resulting in the generation of questions with low contextual relevance due to insufficient learning of question-answer alignment. To overcome this limitation, we propose a novel framework called Dialogizer, which has the capability to automatically generate ConvQA datasets with high contextual relevance from textual sources. The framework incorporates two training tasks: question-answer matching (QAM) and topic-aware dialog generation (TDG). Moreover, re-ranking is conducted during the inference phase based on the contextual relevance of the generated questions. Using our framework, we produce four ConvQA datasets by utilizing documents from multiple domains as the primary source. Through automatic evaluation using diverse metrics, as well as human evaluation, we validate that our proposed framework exhibits the ability to generate datasets of higher quality compared to the baseline dialog inpainting model.

2022

pdf
Improving Multiple Documents Grounded Goal-Oriented Dialog Systems via Diverse Knowledge Enhanced Pretrained Language Model
Yunah Jang | Dongryeol Lee | Hyung Joo Park | Taegwan Kang | Hwanhee Lee | Hyunkyung Bae | Kyomin Jung
Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering

In this paper, we mainly discuss about our submission to MultiDoc2Dial task, which aims to model the goal-oriented dialogues grounded in multiple documents. The proposed task is split into grounding span prediction and agent response generation. The baseline for the task is the retrieval augmented generation model, which consists of a dense passage retrieval model for the retrieval part and the BART model for the generation part. The main challenge of this task is that the system requires a great amount of pre-trained knowledge to generate answers grounded in multiple documents. To overcome this challenge, we adopt model pretraining, fine-tuning, and multi-task learning to enhance our model’s coverage of pretrained knowledge. We experimented with various settings of our method to show the effectiveness of our approaches.

pdf
Factual Error Correction for Abstractive Summaries Using Entity Retrieval
Hwanhee Lee | Cheoneum Park | Seunghyun Yoon | Trung Bui | Franck Dernoncourt | Juae Kim | Kyomin Jung
Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

Despite the recent advancements in abstractive summarization systems leveraged from large-scale datasets and pre-trained language models, the factual correctness of the summary is still insufficient. One line of trials to mitigate this problem is to include a post-editing process that can detect and correct factual errors in the summary. In building such a system, it is strongly required that 1) the process has a high success rate and interpretability and 2) it has a fast running time. Previous approaches focus on the regeneration of the summary, resulting in low interpretability and high computing resources. In this paper, we propose an efficient factual error correction system RFEC based on entity retrieval. RFEC first retrieves the evidence sentences from the original document by comparing the sentences with the target summary to reduce the length of the text to analyze. Next, RFEC detects entity-level errors in the summaries using the evidence sentences and substitutes the wrong entities with the accurate entities from the evidence sentences. Experimental results show that our proposed error correction system shows more competitive performance than baseline methods in correcting factual errors with a much faster speed.

pdf
Masked Summarization to Generate Factually Inconsistent Summaries for Improved Factual Consistency Checking
Hwanhee Lee | Kang Min Yoo | Joonsuk Park | Hwaran Lee | Kyomin Jung
Findings of the Association for Computational Linguistics: NAACL 2022

Despite the recent advances in abstractive summarization systems, it is still difficult to determine whether a generated summary is factual consistent with the source text. To this end, the latest approach is to train a factual consistency classifier on factually consistent and inconsistent summaries. Luckily, the former is readily available as reference summaries in existing summarization datasets. However, generating the latter remains a challenge, as they need to be factually inconsistent, yet closely relevant to the source text to be effective. In this paper, we propose to generate factually inconsistent summaries using source texts and reference summaries with key information masked. Experiments on seven benchmark datasets demonstrate that factual consistency classifiers trained on summaries generated using our method generally outperform existing models and show a competitive correlation with human judgments. We also analyze the characteristics of the summaries generated using our method. We will release the pre-trained model and the code at https://github.com/hwanheelee1993/MFMA.

2021

pdf
UMIC: An Unreferenced Metric for Image Captioning via Contrastive Learning
Hwanhee Lee | Seunghyun Yoon | Franck Dernoncourt | Trung Bui | Kyomin Jung
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Despite the success of various text generation metrics such as BERTScore, it is still difficult to evaluate the image captions without enough reference captions due to the diversity of the descriptions. In this paper, we introduce a new metric UMIC, an Unreferenced Metric for Image Captioning which does not require reference captions to evaluate image captions. Based on Vision-and-Language BERT, we train UMIC to discriminate negative captions via contrastive learning. Also, we observe critical problems of the previous benchmark dataset (i.e., human annotations) on image captioning metric, and introduce a new collection of human annotations on the generated captions. We validate UMIC on four datasets, including our new dataset, and show that UMIC has a higher correlation than all previous metrics that require multiple references.

pdf
KPQA: A Metric for Generative Question Answering Using Keyphrase Weights
Hwanhee Lee | Seunghyun Yoon | Franck Dernoncourt | Doo Soon Kim | Trung Bui | Joongbo Shin | Kyomin Jung
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

In the automatic evaluation of generative question answering (GenQA) systems, it is difficult to assess the correctness of generated answers due to the free-form of the answer. Especially, widely used n-gram similarity metrics often fail to discriminate the incorrect answers since they equally consider all of the tokens. To alleviate this problem, we propose KPQA metric, a new metric for evaluating the correctness of GenQA. Specifically, our new metric assigns different weights to each token via keyphrase prediction, thereby judging whether a generated answer sentence captures the key meaning of the reference answer. To evaluate our metric, we create high-quality human judgments of correctness on two GenQA datasets. Using our human-evaluation datasets, we show that our proposed metric has a significantly higher correlation with human judgments than existing metrics in various datasets. Code for KPQA-metric will be available at https://github.com/hwanheelee1993/KPQA.

pdf
Learning to Select Question-Relevant Relations for Visual Question Answering
Jaewoong Lee | Heejoon Lee | Hwanhee Lee | Kyomin Jung
Proceedings of the Third Workshop on Multimodal Artificial Intelligence

Previous existing visual question answering (VQA) systems commonly use graph neural networks(GNNs) to extract visual relationships such as semantic relations or spatial relations. However, studies that use GNNs typically ignore the importance of each relation and simply concatenate outputs from multiple relation encoders. In this paper, we propose a novel layer architecture that fuses multiple visual relations through an attention mechanism to address this issue. Specifically, we develop a model that uses question embedding and joint embedding of the encoders to obtain dynamic attention weights with regard to the type of questions. Using the learnable attention weights, the proposed model can efficiently use the necessary visual relation features for a given question. Experimental results on the VQA 2.0 dataset demonstrate that the proposed model outperforms existing graph attention network-based architectures. Additionally, we visualize the attention weight and show that the proposed model assigns a higher weight to relations that are more relevant to the question.

pdf
QACE: Asking Questions to Evaluate an Image Caption
Hwanhee Lee | Thomas Scialom | Seunghyun Yoon | Franck Dernoncourt | Kyomin Jung
Findings of the Association for Computational Linguistics: EMNLP 2021

In this paper we propose QACE, a new metric based on Question Answering for Caption Evaluation to evaluate image captioning based on Question Generation(QG) and Question Answering(QA) systems. QACE generates questions on the evaluated caption and check its content by asking the questions on either the reference caption or the source image. We first develop QACE_Ref that compares the answers of the evaluated caption to its reference, and report competitive results with the state-of-the-art metrics. To go further, we propose QACE_Img, that asks the questions directly on the image, instead of reference. A Visual-QA system is necessary for QACE_Img. Unfortunately, the standard VQA models are actually framed a classification among only few thousands categories. Instead, we propose Visual-T5, an abstractive VQA system. The resulting metric, QACE_Img is multi-modal, reference-less and explainable. Our experiments show that QACE_Img compares favorably w.r.t. other reference-less metrics.

2020

pdf
ViLBERTScore: Evaluating Image Caption Using Vision-and-Language BERT
Hwanhee Lee | Seunghyun Yoon | Franck Dernoncourt | Doo Soon Kim | Trung Bui | Kyomin Jung
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

In this paper, we propose an evaluation metric for image captioning systems using both image and text information. Unlike the previous methods that rely on textual representations in evaluating the caption, our approach uses visiolinguistic representations. The proposed method generates image-conditioned embeddings for each token using ViLBERT from both generated and reference texts. Then, these contextual embeddings from each of the two sentence-pair are compared to compute the similarity score. Experimental results on three benchmark datasets show that our method correlates significantly better with human judgments than all existing metrics.

2018

pdf
AttnConvnet at SemEval-2018 Task 1: Attention-based Convolutional Neural Networks for Multi-label Emotion Classification
Yanghoon Kim | Hwanhee Lee | Kyomin Jung
Proceedings of the 12th International Workshop on Semantic Evaluation

In this paper, we propose an attention-based classifier that predicts multiple emotions of a given sentence. Our model imitates human’s two-step procedure of sentence understanding and it can effectively represent and classify sentences. With emoji-to-meaning preprocessing and extra lexicon utilization, we further improve the model performance. We train and evaluate our model with data provided by SemEval-2018 task 1-5, each sentence of which has several labels among 11 given emotions. Our model achieves 5th/1st rank in English/Spanish respectively.