Seonmin Koo


2025

pdf bib
Semantic Inversion, Identical Replies: Revisiting Negation Blindness in Large Language Models
Jinsung Kim | Seonmin Koo | Heuiseok Lim
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) often fail to capture semantic changes in queries due to negation, and generate incorrect responses. Negation frequently exists in the real world and is useful for understanding the opposite or absence of a statement, so it is an essential element in logical reasoning. Previous studies have explored LLMs’ ability to capture negations ‘separately’ from their ability to properly ground knowledge for positive queries. However, this perspective is limited in that it cannot clearly distinguish whether the cause of incorrect responses is the logical incoherence caused by negations or the lack of grounding ability for the given context. To address this issue, we focus on the phenomenon of the model failing to capture semantic contradictions in negated queries despite its accurate understanding of knowledge about positive queries. We term this phenomenon negation blindness on the query. We propose a verification framework that includes task design and measurement methods to verify this issue. In detail, we establish two criteria for systematic task design–i) ‘complexity’ and ii) ‘constrainedness’–and devise four verification tasks accordingly. Moreover, we analyze the results extensively and provide insights into problem alleviation feasibility through experiments on various approaches. Our code and resources can be found at https://www.github.com/jin62304/NegationBlindness.

pdf bib
LimaCost: Data Valuation for Instruction Tuning of Large Language Models
Hyeonseok Moon | Jaehyung Seo | Seonmin Koo | Jinsung Kim | Young-kyoung Ham | Jiwon Moon | Heuiseok Lim
Findings of the Association for Computational Linguistics: EMNLP 2025

Instruction tuning (IT) is an effective approach for aligning large language models (LLMs) with human intentions. There is ongoing discourse regarding the data quality for IT. As an effort to find the robust criteria of data quality for IT, we introduce LimaCost, a data quality measure that exhibits a strong correlation with model performance. LimaCost utilizes LIMA dataset, which effectiveness in IT has already been validated by several previous works. LimaCost then estimates the value of a given data by estimating how many LIMA data points might be needed to approximate its gradient. Our experiments reveal that LimaCost enables effective data selection that derive high alignment performance. We demonstrate that selecting data based on high LimaCost proves to be more effective than existing data selection strategies.

pdf bib
HAWK: Highlighting Entity-aware Knowledge for Alleviating Information Sparsity in Long Contexts
Seonmin Koo | Jinsung Kim | Chanjun Park | Heuiseok Lim
Findings of the Association for Computational Linguistics: EMNLP 2025

As the textual data given as the context of various tasks lengthens, having necessary information scattered throughout makes it more difficult for large language models (LLMs) to capture relevant details. This challenge is particularly prominent in tasks such as question answering (QA), where key information is often not evenly distributed within the context. This problem of information sparsity has led to the attempts of various approaches, such as direct context adjustment and retrieval-based methods. However, these approaches typically leverage compressed contexts, which increases the risk that key information may be contained in the dropped portions. Therefore, research from the perspective of addressing the information sparsity while not losing key details in contexts is required. To address this issue, we propose Highlighting entity-AWare Knowledge (HAWK) framework. HAWK consists of three main steps: i) entity extraction, ii) entity-aware subcontext selection, and iii) triplet construction. The core mechanism of HAWK is to highlight key information in a context and structuralize it in an entity-aware manner, facilitating knowledge-enhanced generation. Through extensive experiments and comprehensive analysis, HAWK confirms significant improvements in QA tasks with long contexts, achieving up to a 27.6-point F1 score increase and at least an average win rate of 76.75% over existing methods.

2024

pdf bib
PANDA: Persona Attributes Navigation for Detecting and Alleviating Overuse Problem in Large Language Models
Jinsung Kim | Seonmin Koo | Heuiseok Lim
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

In the persona-grounded dialogue (PGD) task, it is required not only to respond fluently, but also to ground the attributes according to the current conversation topic properly. However, due to their tendency to overly ground given attributes, LLMs often generate unnatural responses provoked by using attributes that deviate from the flow of the conversation or by exploiting too many attributes at once. We term this phenomenon the *overuse* problem of LLMs. Unfortunately, research devising precise criteria and frameworks to quantitatively verify LLMs’ *overuse* problem is obviously insufficient. To address this issue, we propose **P**ersona **A**ttributes **N**avigation for **D**etecting and **A**lleviating the *overuse* problem (**PANDA**) framework. **PANDA** is the first study to quantify the persona *overuse* problem of LLMs by establishing clear standards of the problem and verifying various LLMs based on them. Moreover, this framework navigates us into understanding persona attributes by introducing diverse and detailed dialogue topics that consider practical conversation situations. We provide insights related to LLMs’ persona attribute *overuse* problem through comprehensive verification and analysis with **PANDA** in the PGD task. Our code and resources can be found at http://github.com/jin62304/PANDA.

pdf bib
Where am I? Large Language Models Wandering between Semantics and Structures in Long Contexts
Seonmin Koo | Jinsung Kim | YoungJoon Jang | Chanjun Park | Heuiseok Lim
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

As the utilization of Large Language Models (LLMs) becomes more widespread, there is a growing demand for their ability to handle more complex and longer external knowledge across various use cases. Most existing evaluations of the open-ended question answering (ODQA) task, which necessitates the use of external knowledge, focus solely on whether the model provides the correct answer. However, even when LLMs answer correctly, they often fail to provide an obvious source for their responses. Therefore, it is necessary to jointly evaluate and verify the correctness of the answers and the appropriateness of grounded evidence in complex external contexts. To address this issue, we examine the phenomenon of discrepancies in abilities across two distinct tasks—QA and evidence selection—when performed simultaneously, from the perspective of task alignment. To verify LLMs’ task alignment, we introduce a verification framework and resources considering both semantic relevancy and structural diversity of the given long context knowledge. Through extensive experiments and detailed analysis, we provide insights into the task misalignment between QA and evidence selection. Our code and resources will be available upon acceptance.

pdf bib
Search if you don’t know! Knowledge-Augmented Korean Grammatical Error Correction with Large Language Models
Seonmin Koo | Jinsung Kim | Chanjun Park | Heuiseok Lim
Findings of the Association for Computational Linguistics: EMNLP 2024

Grammatical error correction (GEC) system is a practical task used in the real world, showing high achievements alongside the development of large language models (LLMs). However, these achievements have been primarily obtained in English, and there is a relative lack of performance for non-English data, such as Korean. We hypothesize that this insufficiency occurs because relying solely on the parametric knowledge of LLMs makes it difficult to thoroughly understand the given context in the Korean GEC. Therefore, we propose a Knowledge-Augmented GEC (KAGEC) framework that incorporates evidential information from external sources into the prompt for the GEC task. KAGEC first extracts salient phrases from the given source and retrieves non-parametric knowledge based on these phrases, aiming to enhance the context-aware generation capabilities of LLMs. Furthermore, we conduct validations for fine-grained error types to identify those requiring a retrieval-augmented manner when LLMs perform Korean GEC. According to experimental results, most LLMs, including ChatGPT, demonstrate significant performance improvements when applying KAGEC.

pdf bib
Detecting Critical Errors Considering Cross-Cultural Factors in English-Korean Translation
Sugyeong Eo | Jungwoo Lim | Chanjun Park | DaHyun Jung | Seonmin Koo | Hyeonseok Moon | Jaehyung Seo | Heuiseok Lim
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Recent machine translation (MT) systems have overcome language barriers for a wide range of users, yet they still carry the risk of critical meaning deviation. Critical error detection (CED) is a task that identifies an inherent risk of catastrophic meaning distortions in the machine translation output. With the importance of reflecting cultural elements in detecting critical errors, we introduce the culture-aware “Politeness” type in detecting English-Korean critical translation errors. Besides, we facilitate two tasks by providing multiclass labels: critical error detection and critical error type classification (CETC). Empirical evaluations reveal that our introduced data augmentation approach using a newly presented perturber significantly outperforms existing baselines in both tasks. Further analysis highlights the significance of multiclass labeling by demonstrating its superior effectiveness compared to binary labels.

2023

pdf bib
KEBAP: Korean Error Explainable Benchmark Dataset for ASR and Post-processing
Seonmin Koo | Chanjun Park | Jinsung Kim | Jaehyung Seo | Sugyeong Eo | Hyeonseok Moon | Heuiseok Lim
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Automatic Speech Recognition (ASR) systems are instrumental across various applications, with their performance being critically tied to user satisfaction. Conventional evaluation metrics for ASR systems produce a singular aggregate score, which is insufficient for understanding specific system vulnerabilities. Therefore, we aim to address the limitations of the previous ASR evaluation methods by introducing the Korean Error Explainable Benchmark Dataset for ASR and Post-processing (KEBAP). KEBAP enables comprehensive analysis of ASR systems at both speech- and text levels, thereby facilitating a more balanced assessment encompassing speech recognition accuracy and user readability. KEBAP provides 37 newly defined speech-level resources incorporating diverse noise environments and speaker characteristics categories, also presenting 13 distinct text-level error types. This paper demonstrates detailed statistical analyses of colloquial noise categories and textual error types. Furthermore, we conduct extensive validation and analysis on commercially deployed ASR systems, providing valuable insights into their performance. As a more fine-grained and real-world-centric evaluation method, KEBAP contributes to identifying and mitigating potential weaknesses in ASR systems.

2022

pdf bib
A Dog Is Passing Over The Jet? A Text-Generation Dataset for Korean Commonsense Reasoning and Evaluation
Jaehyung Seo | Seounghoon Lee | Chanjun Park | Yoonna Jang | Hyeonseok Moon | Sugyeong Eo | Seonmin Koo | Heuiseok Lim
Findings of the Association for Computational Linguistics: NAACL 2022

Recent natural language understanding (NLU) research on the Korean language has been vigorously maturing with the advancements of pretrained language models and datasets. However, Korean pretrained language models still struggle to generate a short sentence with a given condition based on compositionality and commonsense reasoning (i.e., generative commonsense reasoning). The two major challenges are inadequate data resources to develop generative commonsense reasoning regarding Korean linguistic features and to evaluate language models which are necessary for natural language generation (NLG). To solve these problems, we propose a text-generation dataset for Korean generative commonsense reasoning and language model evaluation. In this work, a semi-automatic dataset construction approach filters out contents inexplicable to commonsense, ascertains quality, and reduces the cost of building the dataset. We also present an in-depth analysis of the generation results of language models with various evaluation metrics along with human-annotated scores. The whole dataset is publicly available at (https://aihub.or.kr/opendata/korea-university).