Gary Geunbae Lee - ACL Anthology

This page is part of a temporary preview of a proposed change that may be incomplete or contain mistakes. It is not official and will be removed when the change is merged or abandoned.

Gary Geunbae Lee

Also published as: Geunbae Lee, Gary Lee, Gary Geunbae Lee

2026

A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation
Seonjeong Hwang | Jun Seo | Hyounghun Kim | Gary Lee
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent studies in difficulty-controlled reading comprehension item generation have leveraged large language models (LLMs) to produce items by adjusting difficulty-related features. However, existing methods typically rely on a single-agent prompting approach, which often fails to consistently satisfy specified feature constraints, resulting in items that deviate from the target difficulty level. To address this limitation, we introduce MAFIG, a Multi-agent Framework for Feature-constrained Item Generation, where multiple LLM agents and feature-specific evaluators collaborate to generate and iteratively revise items based on intended constraints. Furthermore, to verify the efficacy of MAFIG in difficulty control, we propose a method for constructing a sequence of feature constraint sets that yield items with monotonically increasing difficulty. Experimental results demonstrate that MAFIG generates items that adhere to target constraints at a significantly higher rate than baselines, achieving robust difficulty control through the difficulty-calibrated constraint sequence.

Difficulty-Controllable Cloze Question Distractor Generation
Seokhoon Kang | Yejin Jeon | Seonjeong Hwang | Gary Lee
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Multiple-choice cloze questions are commonly used to assess linguistic proficiency and comprehension. However, generating high-quality distractors remains challenging, as existing methods often lack adaptability and control over difficulty levels, and the absence of difficulty-annotated datasets further hinders progress. To address these issues, we propose a novel framework for generating distractors with controllable difficulty by leveraging both data augmentation and a multitask learning strategy. First, to create a high-quality, difficulty-annotated dataset, we introduce a two-way distractor generation process to produce diverse and plausible distractors. These candidates are filtered and then categorized by difficulty using an ensemble QA system. Second, this newly created dataset is used to train a difficulty-controllable generation model via multitask learning. Experimental results demonstrate that our method generates high-quality distractors across difficulty levels and substantially outperforms GPT-4o in aligning distractor difficulty with human perception.

Adaptive Planning for Multi-Attribute Controllable Summarization with Monte Carlo Tree Search
Sangwon Ryu | Heejin Do | Yunsu Kim | Gary Lee | Jungseul Ok
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Controllable summarization moves beyond generic outputs toward human-aligned summaries guided by specified attributes. In practice, the interdependence among attributes makes it challenging for language models to satisfy correlated constraints consistently. Moreover, previous approaches often require per-attribute fine-tuning, limiting flexibility across diverse summary attributes. In this paper, we propose adaptive planning for multi-attribute controllable summarization (PACO), a training-free framework that reframes the task as planning the order of sequential attribute control with a customized Monte Carlo Tree Search (MCTS). In PACO, nodes represent summaries, and actions correspond to single-attribute adjustments, enabling progressive refinement of only the attributes requiring further control. This strategy adaptively discovers optimal control orders, ultimately producing summaries that effectively meet all constraints. Extensive experiments across diverse domains and models demonstrate that PACO achieves robust multi-attribute controllability, surpassing both LLM-based self-planning models and fine-tuned baselines. Remarkably, PACO with Llama-3.2-1B rivals the controllability of the much larger Llama-3.3-70B baselines. With larger models, PACO achieves superior control performance, outperforming all competitors.

Behavior-Aware Item Modeling via Dynamic Procedural Solution Representations for Knowledge Tracing
Jun Seo | Sangwon Ryu | Heejin Do | Hyounghun Kim | Gary Lee
Findings of the Association for Computational Linguistics: ACL 2026

Knowledge Tracing (KT) aims to predict learners’ future performance from past interactions. While recent KT approaches have improved via learning item representations aligned with Knowledge Components, they overlook the procedural dynamics of problem solving. We propose Behavior-Aware Item Modeling (BAIM), a framework that enriches item representations by integrating dynamic procedural solution information. BAIM leverages a reasoning language model to decompose each item’s solution into four problem-solving stages (i.e., understand, plan, carry out, and look back), pedagogically grounded in Polya’s framework. Specifically, it derives stage-level representations from per-stage embedding trajectories, capturing latent signals beyond surface features. To reflect learner heterogeneity, BAIM adaptively routes these stage-wise representations, introducing a context-conditioned mechanism within a KT backbone, allowing different procedural stages to be emphasized for different learners. Experiments on XES3G5M and NIPS34 show that BAIM consistently outperforms strong pretraining-based baselines, achieving particularly large gains under repeated learner interactions.

Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models
San Kim | Gary Lee
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Language Models (LLMs) have greatly advanced Natural Language Processing (NLP), particularly through instruction tuning, which enables broad task generalization without additional fine-tuning. However, their reliance on large-scale datasets—often collected from human or web sources—makes them vulnerable to backdoor attacks, where adversaries poison a small subset of data to implant hidden behaviors. Despite this growing risk, defenses for instruction-tuned models remain underexplored. We propose MB-Defense (Merging & Breaking Defense Framework), a novel training pipeline that immunizes instruction-tuned LLMs against diverse backdoor threats. MB-Defense comprises two stages: (i) Defensive Poisoning, which merges attacker and defensive triggers into a unified backdoor representation, and (ii) Backdoor Neutralization, which breaks this representation through additional training to restore clean behavior. Extensive experiments across multiple LLMs show that MB-Defense substantially lowers attack success rates while preserving instruction-following ability. Our method offers a generalizable and data-efficient defense strategy, improving the robustness of instruction-tuned LLMs against unseen backdoor attacks.

Exploring Iterative Controllable Summarization with Large Language Models
Sangwon Ryu | Heejin Do | Daehui Kim | Hwanjo Yu | Dongwoo Kim | Yunsu Kim | Gary Lee | Jungseul Ok
Findings of the Association for Computational Linguistics: EACL 2026

Large language models (LLMs) have demonstrated remarkable performance in abstractive summarization tasks. However, their ability to precisely control summary attributes (e.g., length or topic) remains underexplored, limiting their adaptability to specific user preferences. In this paper, we systematically explore the controllability of LLMs. To this end, we revisit summary attribute measurements and introduce iterative evaluation metrics, failure rate and average iteration count, to more precisely evaluate controllability beyond assessment of errors. Our findings show that LLMs struggle more with numerical attributes than with linguistic attributes. To address this challenge, we propose a guide-to-explain framework (GTE) for controllable summarization. GTE enables the model to identify misaligned attributes in the initial draft and guides it to self-explain errors in the previous output. By encouraging reflection on attribute misalignment, GTE generates well-adjusted summaries that satisfy the desired attributes with robust effectiveness while requiring surprisingly fewer iterations than other iterative approaches.

Mixture-of-Experts with Intermediate CTC Supervision for Accented Speech Recognition
Wonjun Lee | Hyounghun Kim | Gary Lee
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Accented speech remains a persistent challenge for automatic speech recognition (ASR), as most models are trained on data dominated by a few high-resource English varieties, leading to substantial performance degradation for other accents. Accent-agnostic approaches improve robustness yet struggle with heavily accented or unseen varieties, while accent-specific methods rely on limited and often noisy labels. We introduce MoE-CTC, a Mixture-of-Experts architecture with intermediate CTC supervision that jointly promotes expert specialization and generalization. During training, accent-aware routing encourages experts to capture accent-specific patterns, which gradually transitions to label-free routing for inference. Each expert is equipped with its own CTC head to align routing with transcription quality, and a routing-augmented loss further stabilizes optimization. Experiments on the MCV-Accent benchmark demonstrate consistent gains across both seen and unseen accents in low- and high-resource conditions, achieving up to 29.3% relative WER reduction over strong FastConformer baselines.

Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?
Deokhyung Kang | Seonjeong Hwang | Daehui Kim | Hyounghun Kim | Gary Lee
Findings of the Association for Computational Linguistics: ACL 2026

Reasoning language models (RLMs) achieve strong performance on complex reasoning tasks, yet they still exhibit a multilingual reasoning gap, performing better in high-resource languages than in low-resource ones. While recent efforts have been made to address this gap, its underlying causes remain largely unexplored. In this work, we show that this gap primarily stems from failures in language understanding—specifically, the model’s inability to translate multilingual inputs into the language dominating its reasoning traces (typically English). As identifying understanding failures can enable targeted mitigation of the gap, we evaluate a range of detection methods and find that understanding failures are detectable to a meaningful extent, with supervised approaches performing best. Building on this, we propose Selective Translation, a strategy that incorporates an English translation into the initial reasoning trace when an understanding failure is detected. Experimental results using Qwen3-4B show that Selective Translation substantially bridges the multilingual reasoning gap, achieving near full-translation performance while translating only about 20% of inputs. Together, our results show that failures in language understanding are the primary driver of the multilingual reasoning gap and can be detected and selectively mitigated, clarifying its origin and suggesting a path toward more equitable multilingual reasoning.

Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items?
Seonjeong Hwang | Hyounghun Kim | Gary Lee
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Estimating the cognitive complexity of reading comprehension (RC) items is crucial for assessing item difficulty before it is administered to learners. Unlike syntactic and semantic features, such as passage length or semantic similarity between options, cognitive features that arise during answer reasoning are not readily extractable using existing NLP tools and have traditionally relied on human annotation. In this study, we examine whether large language models (LLMs) can estimate the cognitive complexity of RC items by focusing on two dimensions—Evidence Scope and Transformation Level—that indicate the degree of cognitive burden involved in reasoning about the answer. Our experimental results demonstrate that LLMs can approximate the cognitive complexity of items, indicating their potential as tools for prior difficulty analysis. Further analysis reveals a gap between LLMs’ reasoning ability and their metacognitive awareness: even when they produce correct answers, they sometimes fail to correctly identify the features underlying their own reasoning process.

2025

Safeguarding RAG Pipelines with GMTP: A Gradient-based Masked Token Probability Method for Poisoned Document Detection
San Kim | Jonghwi Kim | Yejin Jeon | Gary Lee
Findings of the Association for Computational Linguistics: ACL 2025

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by providing external knowledge for accurate and up-to-date responses. However, this reliance on external sources exposes a security risk; attackers can inject poisoned documents into the knowledge base to steer the generation process toward harmful or misleading outputs. In this paper, we propose Gradient-based Masked Token Probability (GMTP), a novel defense method to detect and filter out adversarially crafted documents. Specifically, GMTP identifies high-impact tokens by examining gradients of the retriever’s similarity function. These key tokens are then masked, and their probabilities are checked via a Masked Language Model (MLM). Since injected tokens typically exhibit markedly low masked-token probabilities, this enables GMTP to easily detect malicious documents and achieve high-precision filtering. Experiments demonstrate that GMTP is able to eliminate over 90% of poisoned content while retaining relevant documents, thus maintaining robust retrieval and generation performance across diverse datasets and adversarial settings.

Self-Correcting Code Generation Using Small Language Models
Jeonghun Cho | Deokhyung Kang | Hyounghun Kim | Gary Lee
Findings of the Association for Computational Linguistics: EMNLP 2025

Self-correction has demonstrated potential in code generation by allowing language models to revise and improve their outputs through successive refinement. Recent studies have explored prompting-based strategies that incorporate verification or feedback loops using proprietary models, as well as training-based methods that leverage their strong reasoning capabilities. However, whether smaller models possess the capacity to effectively guide their outputs through self-reflection remains unexplored. Our findings reveal that smaller models struggle to exhibit reflective revision behavior across both self-correction paradigms. In response, we introduce CoCoS, an approach designed to enhance the ability of small language models for multi-turn code correction. Specifically, we propose an online reinforcement learning objective that trains the model to confidently maintain correct outputs while progressively correcting incorrect outputs as turns proceed. Our approach features an accumulated reward function that aggregates rewards across the entire trajectory and a fine-grained reward better suited to multi-turn correction scenarios. This facilitates the model in enhancing initial response quality while achieving substantial improvements through self-correction. With 1B-scale models, CoCoS achieves improvements of 35.8% on the MBPP and 27.7% on HumanEval compared to the baselines.

DeRAGEC: Denoising Named Entity Candidates with Synthetic Rationale for ASR Error Correction
Solee Im | Wonjun Lee | JinMyeong An | Yunsu Kim | Jungseul Ok | Gary Lee
Findings of the Association for Computational Linguistics: ACL 2025

We present DeRAGEC, a method for improving Named Entity (NE) correction in Automatic Speech Recognition (ASR) systems. By extending the Retrieval-Augmented Generative Error Correction (RAGEC) framework, DeRAGEC employs synthetic denoising rationales to filter out noisy NE candidates before correction. By leveraging phonetic similarity and augmented definitions, it refines noisy retrieved NEs using in-context learning, requiring no additional training. Experimental results on CommonVoice and STOP datasets show significant improvements in Word Error Rate (WER) and NE hit ratio, outperforming baseline ASR and RAGEC methods. Specifically, we achieved a 28% relative reduction in WER compared to ASR without postprocessing.

Leveraging What’s Overfixed: Post-Correction via LLM Grammatical Error Overcorrection
Taehee Park | Heejin Do | Gary Lee
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Robust supervised fine-tuned small Language Models (sLMs) often show high reliability but tend to undercorrect. They achieve high precision at the cost of low recall. Conversely, Large Language Models (LLMs) often show the opposite tendency, making excessive overcorrection, leading to low precision. To effectively harness the strengths of LLMs to address the recall challenges in sLMs, we propose Post-Correction via Overcorrection (PoCO), a novel approach that strategically balances recall and precision. PoCO first intentionally triggers overcorrection via LLM to maximize recall by allowing comprehensive revisions, then applies a targeted post-correction step via fine-tuning smaller models to identify and refine erroneous outputs. We aim to harmonize both aspects by leveraging the generative power of LLMs while preserving the reliability of smaller supervised models. Our extensive experiments demonstrate that PoCO effectively balances GEC performance by increasing recall with competitive precision, ultimately improving the overall quality of grammatical error correction.

Multimodal Cognitive Reframing Therapy via Multi-hop Psychotherapeutic Reasoning
Subin Kim | Hoonrae Kim | Heejin Do | Gary Lee
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Previous research has revealed the potential of large language models (LLMs) to support cognitive reframing therapy; however, their focus was primarily on text-based methods, often overlooking the importance of non-verbal evidence crucial in real-life therapy. To alleviate this gap, we extend the textual cognitive reframing to multimodality, incorporating visual clues. Specifically, we present a new dataset called Multi Modal-Cognitive Support Conversation (M2CoSC), which pairs each GPT-4-generated dialogue with an image that reflects the virtual client’s facial expressions.To better mirror real psychotherapy, where facial expressions lead to interpreting implicit emotional evidence, we propose a multi-hop psychotherapeutic reasoning approach that explicitly identifies and incorporates subtle evidence. Our comprehensive experiments with both LLMs and vision-language models (VLMs) demonstrate that the VLMs’ performance as psychotherapists is significantly improved with the M2CoSC dataset. Furthermore, the multi-hop psychotherapeutic reasoning method enables VLMs to provide more thoughtful and empathetic suggestions, outperforming standard prompting methods.

Multi-Facet Blending for Faceted Query-by-Example Retrieval
Heejin Do | Sangwon Ryu | Jonghwi Kim | Gary Lee
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

With the growing demand to fit fine-grained user intents, faceted query-by-example (QBE), which retrieves similar documents conditioned on specific facets, has gained recent attention. However, prior approaches mainly depend on document-level comparisons using basic indicators like citations due to the lack of facet-level relevance datasets; yet, this limits their use to citation-based domains and fails to capture the intricacies of facet constraints. In this paper, we propose a multi-facet blending (FaBle) augmentation method, which exploits modularity by decomposing and recomposing to explicitly synthesize facet-specific training sets. We automatically decompose documents into facet units and generate (ir)relevant pairs by leveraging LLMs’ intrinsic distinguishing capabilities; then, dynamically recomposing the units leads to facet-wise relevance-informed document pairs. Our modularization eliminates the need for pre-defined facet knowledge or labels. Further, to prove the FaBle’s efficacy in a new domain beyond citation-based scientific paper retrieval, we release a benchmark dataset for educational exam item QBE. FaBle augmentation on 1K documents remarkably assists training in obtaining facet conditional embeddings.

Retrieval-Augmented Fine-Tuning With Preference Optimization For Visual Program Generation
Deokhyung Kang | Jeonghun Cho | Yejin Jeon | Sunbin Jang | Minsub Lee | Jawoon Cho | Gary Lee
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Visual programming languages (VPLs) allow users to create programs through graphical interfaces, which results in easier accessibility and their widespread usage in various domains. To further enhance this accessibility, recent research has focused on generating VPL code from user instructions using large language models (LLMs). Specifically, by employing prompting-based methods, these studies have shown promising results. Nevertheless, such approaches can be less effective for industrial VPLs such as Ladder Diagram (LD). LD is a pivotal language used in industrial automation processes and involves extensive domain-specific configurations, which are difficult to capture in a single prompt. In this work, we demonstrate that training-based methods outperform prompting-based methods for LD generation accuracy, even with smaller backbone models. Building on these findings, we propose a two-stage training strategy to further enhance VPL generation. First, we employ retrieval-augmented fine-tuning to leverage the repetitive use of subroutines commonly seen in industrial VPLs. Second, we apply direct preference optimization (DPO) to further guide the model toward accurate outputs, using systematically generated preference pairs through graph editing operations. Extensive experiments on real-world LD data demonstrate that our approach improves program-level accuracy by over 10% compared to supervised fine-tuning, which highlights its potential to advance industrial automation.

K-COMP: Retrieval-Augmented Medical Domain Question Answering With Knowledge-Injected Compressor
Jeonghun Cho | Gary Lee
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Progressive Facial Granularity Aggregation with Bilateral Attribute-based Enhancement for Face-to-Speech Synthesis
Yejin Jeon | Youngjae Kim | Jihyun Lee | Hyounghun Kim | Gary Lee
Findings of the Association for Computational Linguistics: EMNLP 2025

For individuals who have experienced traumatic events such as strokes, speech may no longer be a viable means of communication. While text-to-speech (TTS) can be used as a communication aid since it generates synthetic speech, it fails to preserve the user’s own voice. As such, face-to-voice (FTV) synthesis, which derives corresponding voices from facial images, provides a promising alternative. However, existing methods rely on pre-trained visual encoders, and finetune them to align with speech embeddings, which strips fine-grained information from facial inputs such as gender or ethnicity, despite their known correlation with vocal traits. Moreover, these pipelines are multi-stage, which requires separate training of multiple components, thus leading to training inefficiency. To address these limitations, we utilize fine-grained facial attribute modeling by decomposing facial images into non-overlapping segments and progressively integrating them into a multi-granular representation. This representation is further refined through multi-task learning of speaker attributes such as gender and ethnicity at both the visual and acoustic domains. Moreover, to improve alignment robustness, we adopt a multi-view training strategy by pairing various visual perspectives of a speaker in terms of different angles and lighting conditions, with identical speech recordings. Extensive subjective and objective evaluations confirm that our approach substantially enhances face-voice congruence and synthesis stability.

KoBLEX: Open Legal Question Answering with Multi-hop Reasoning
Jihyung Lee | Daehui Kim | Seonjeong Hwang | Hyounghun Kim | Gary Lee
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLM) have achieved remarkable performances in general domains and are now extending into the expert domain of law. Several benchmarks have been proposed to evaluate LLMs’ legal capabilities. However, these benchmarks fail to evaluate open-ended and provision-grounded Question Answering (QA). To address this, we introduce a Korean Benchmark for Legal EXplainable QA (KoBLEX), designed to evaluate provision-grounded, multi-hop legal reasoning. KoBLEX includes 226 scenario-based QA instances and their supporting provisions, created using a hybrid LLM–human expert pipeline. We also propose a method called Parametric provision-guided Selection Retrieval (ParSeR), which uses LLM-generated parametric provisions to guide legally grounded and reliable answers. ParSeR facilitates multi-hop reasoning on complex legal questions by generating parametric provisions and employing a three-stage sequential retrieval process. Furthermore, to better evaluate the legal fidelity of the generated answers, we propose Legal Fidelity Evaluation (LF-Eval). LF-Eval is an automatic metric that jointly considers the question, answer, and supporting provisions and shows a high correlation with human judgments. Experimental results show that ParSeR consistently outperforms strong baselines, achieving the best results across multiple LLMs. Notably, compared to standard retrieval with GPT-4o, ParSeR achieves +37.91 higher F1 and +30.81 higher LF-Eval. Further analyses reveal that ParSeR efficiently delivers consistent performance across reasoning depths, with ablations confirming the effectiveness of ParSeR.

GuRE:Generative Query REwriter for Legal Passage Retrieval
Daehui Kim | Deokhyung Kang | Jonghwi Kim | Sangwon Ryu | Gary Lee
Proceedings of the Natural Legal Language Processing Workshop 2025

Legal Passage Retrieval (LPR) systems are crucial as they help practitioners save time when drafting legal arguments. However, it remains an underexplored avenue. One primary reason is the significant vocabulary mismatch between the query and the target passage. To address this, we propose a simple yet effective method, the Generative query REwriter (GuRE). We leverage the generative capabilities of Large Language Models (LLMs) by training the LLM for query rewriting. "Rewritten queries" help retrievers to retrieve target passages by mitigating vocabulary mismatch. Experimental results show that GuRE significantly improves performance in a retriever-agnostic manner, outperforming all baseline methods. Further analysis reveals that different training objectives lead to distinct retrieval behaviors, making GuRE more suitable than direct retriever fine-tuning for real-world applications. Codes are avaiable at github.com/daehuikim/GuRE.

DyPCL: Dynamic Phoneme-level Contrastive Learning for Dysarthric Speech Recognition
Wonjun Lee | Solee Im | Heejin Do | Yunsu Kim | Jungseul Ok | Gary Lee
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Dysarthric speech recognition often suffers from performance degradation due to the intrinsic diversity of dysarthric severity and extrinsic disparity from normal speech. To bridge these gaps, we propose a Dynamic Phoneme-level Contrastive Learning (DyPCL) method, which leads to obtaining invariant representations across diverse speakers. We decompose the speech utterance into phoneme segments for phoneme-level contrastive learning, leveraging dynamic connectionist temporal classification alignment. Unlike prior studies focusing on utterance-level embeddings, our granular learning allows discrimination of subtle parts of speech. In addition, we introduce dynamic curriculum learning, which progressively transitions from easy negative samples to difficult-to-distinguishable negative samples based on phonetic similarity of phoneme. Our approach to training by difficulty levels alleviates the inherent variability of speakers, better identifying challenging speeches. Evaluated on the UASpeech dataset, DyPCL outperforms baseline models, achieving an average 22.10% relative reduction in word error rate (WER) across the overall dysarthria group.

Speak & Spell: LLM-Driven Controllable Phonetic Error Augmentation for Robust Dialogue State Tracking
Jihyun Lee | Solee Im | Wonjun Lee | Gary Lee
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Dialogue State Tracking (DST) is a key part of task-oriented dialogue systems, identifying important information in conversations. However, its accuracy drops significantly in spoken dialogue environments due to named entity errors from Automatic Speech Recognition (ASR) systems. We introduce a simple yet effective data augmentation method that targets those entities to improve the robustness of DST model. Our novel method can control the placement of errors using keyword-highlighted prompts while introducing phonetically similar errors. As a result, our method generated sufficient error patterns on keywords, leading to improved accuracy in noised and low-accuracy ASR environments.

PanicToCalm: A Proactive Counseling Agent for Panic Attacks
Jihyun Lee | Yejin Min | San Kim | Yejin Jeon | Sung Jun Yang | Hyounghun Kim | Gary Lee
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Panic attacks are acute episodes of fear and distress, in which timely, appropriate intervention can significantly help individuals regain stability. However, suitable datasets for training such models remain scarce due to ethical and logistical issues. To address this, we introduce Pace, which is a dataset that includes high-distress episodes constructed from first-person narratives, and structured around the principles of Psychological First Aid (PFA). Using this data, we train Pacer, a counseling model designed to provide both empathetic and directive support, which is optimized through supervised learning and simulated preference alignment. To assess its effectiveness, we propose PanicEval, a multi-dimensional framework covering general counseling quality and crisis-specific strategies. Experimental results show that Pacer outperforms strong baselines in both counselor-side metrics and client affect improvement. Human evaluations further confirm its practical value, with Pacer consistently preferred over general, CBT-based, and GPT-4-powered models in panic scenarios.

PicPersona-TOD : A Dataset for Personalizing Utterance Style in Task-Oriented Dialogue with Image Persona
Jihyun Lee | Yejin Jeon | Seungyeon Seo | Gary Lee
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Task-Oriented Dialogue (TOD) systems are designed to fulfill user requests through natural language interactions, yet existing systems often produce generic, monotonic responses that lack individuality and fail to adapt to users’ personal attributes. To address this, we introduce PicPersona-TOD, a novel dataset that incorporates user images as part of the persona, enabling personalized responses tailored to user-specific factors such as age or emotional context. This is facilitated by first impressions, dialogue policy-guided prompting, and the use of external knowledge to reduce hallucinations. Human evaluations confirm that our dataset enhances user experience, with personalized responses contributing to a more engaging interaction. Additionally, we introduce a new NLG model, Pictor, which not only personalizes responses, but also demonstrates robust performance across unseen domains.

MIRROR: Multimodal Cognitive Reframing Therapy for Rolling with Resistance
Subin Kim | Hoonrae Kim | Jihyun Lee | Yejin Jeon | Gary Lee
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Recent studies have explored the use of large language models (LLMs) in psychotherapy; however, text-based cognitive behavioral therapy (CBT) models often struggle with client resistance, which can weaken therapeutic alliance. To address this, we propose a multimodal approach that incorporates nonverbal cues, which allows the AI therapist to better align its responses with the client’s negative emotional state.Specifically, we introduce a new synthetic dataset, Mirror (Multimodal Interactive Rolling with Resistance), which is a novel synthetic dataset that pairs each client’s statements with corresponding facial images. Using this dataset, we train baseline vision language models (VLMs) so that they can analyze facial cues, infer emotions, and generate empathetic responses to effectively manage client resistance.These models are then evaluated in terms of both their counseling skills as a therapist, and the strength of therapeutic alliance in the presence of client resistance. Our results demonstrate that Mirror significantly enhances the AI therapist’s ability to handle resistance, which outperforms existing text-based CBT approaches.Human expert evaluations further confirm the effectiveness of our approach in managing client resistance and fostering therapeutic alliance.

Towards Prompt Generalization: Grammar-aware Cross-Prompt Automated Essay Scoring
Heejin Do | Taehee Park | Sangwon Ryu | Gary Lee
Findings of the Association for Computational Linguistics: NAACL 2025

In automated essay scoring (AES), recent efforts have shifted toward cross-prompt settings that score essays on unseen prompts for practical applicability. However, prior methods trained with essay-score pairs of specific prompts pose challenges in obtaining prompt-generalized essay representation. In this work, we propose a grammar-aware cross-prompt trait scoring (GAPS), which internally captures prompt-independent syntactic aspects to learn generic essay representation. We acquire grammatical error-corrected information in essays via the grammar error correction technique and design the AES model to seamlessly integrate such information. By internally referring to both the corrected and the original essays, the model can focus on generic features during training. Empirical experiments validate our method’s generalizability, showing remarkable improvements in prompt-independent and grammar-related traits. Furthermore, GAPS achieves notable QWK gains in the most challenging cross-prompt scenario, highlighting its strength in evaluating unseen prompts.

The Limits of Post-hoc Preference Adaptation: A Case Study on DSTC12 Clustering
Jihyun Lee | Gary Lee
Proceedings of the Twelfth Dialog System Technology Challenge

Understanding user intent in dialogue is essential for controllable and coherent conversational AI. In this work, we present a case study on controllable theme induction in dialogue systems using the DSTC12 Track 2 dataset. Our pipeline integrates LLM-based summarization, utterance clustering, and synthetic preference modeling based on should-link and cannot-link predictions. While preference signals offer moderate improvements in cluster refinement, we observe that their effectiveness is significantly constrained by coarse initial clustering. Experiments on the Finance and Insurance domains show that even authentic human labeled preference struggle when initial clusters do not align with human intent. These findings highlight the need to incorporate preference supervision earlier in the pipeline to ensure semantically coherent clustering.

Revisiting Early Detection of Sexual Predators via Turn-level Optimization
JinMyeong An | Sangwon Ryu | Heejin Do | Yunsu Kim | Jungseul Ok | Gary Lee
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Online grooming is a severe social threat where sexual predators gradually entrap child victims with subtle and gradual manipulation. Therefore, timely intervention for online grooming is critical for proactive protection. However, previous methods fail to determine the optimal intervention points (i.e., jump to conclusions) as they rely on chat-level risk labels by causing weak supervision of risky utterances. For timely detection, we propose speed control reinforcement learning (SCoRL), incorporating a practical strategy derived from luring communication theory (LCT). To capture the predator’s turn-level entrapment, we use a turn-level risk label based on the LCT. Then, we design a novel speed control reward function that balances the trade-off between speed and accuracy based on turn-level risk label; thus, SCoRL can identify the optimal intervention moment. In addition, we introduce a turn-level metric for precise evaluation, identifying limitations in previously used chat-level metrics. Experimental results show that SCoRL effectively preempted online grooming, offering a more proactive and timely solution. Further analysis reveals that our method enhances performance while intuitively identifying optimal early intervention points.

Prompt-Guided Selective Masking Loss for Context-Aware Emotive Text-to-Speech
Yejin Jeon | Youngjae Kim | Jihyun Lee | Gary Lee
Findings of the Association for Computational Linguistics: NAACL 2025

Emotional dialogue speech synthesis (EDSS) aims to generate expressive speech by leveraging the dialogue context between interlocutors. This is typically done by concatenating global representations of previous utterances as conditions for text-to-speech (TTS) systems. However, such approaches overlook the importance of integrating localized acoustic cues that convey emotion. To address this, we introduce a novel approach that utilizes a large language model (LLM) to generate holistic emotion tags based on prior dialogue context, while also pinpointing key words in the target utterance that align with the predicted emotional state. Furthermore, we enhance the emotional richness of synthesized speech by incorporating concentrated acoustic features of these key words through a novel selective audio masking loss function. This methodology not only improves emotional expressiveness, but also facilitates automatic emotion speech generation during inference by eliminating the need for manual emotion tag selection. Comprehensive subjective and objective evaluations and analyses demonstrate the effectiveness of the proposed approach.

MiLQ: Benchmarking IR Models for Bilingual Web Search with Mixed Language Queries
Jonghwi Kim | Deokhyung Kang | Seonjeong Hwang | Yunsu Kim | Jungseul Ok | Gary Lee
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Despite bilingual speakers frequently using mixed-language queries in web searches, Information Retrieval (IR) research on them remains scarce. To address this, we introduce ***MiLQ***, ***Mi***xed-***L***anguage ***Q***uery test set, the first public benchmark of mixed-language queries, qualified as realistic and relatively preferred. Experiments show that multilingual IR models perform moderately on MiLQ and inconsistently across native, English, and mixed-language queries, also suggesting code-switched training data’s potential for robust IR models handling such queries. Meanwhile, intentional English mixing in queries proves an effective strategy for bilinguals searching English documents, which our analysis attributes to enhanced token matching compared to native queries.

EnSToM: Enhancing Dialogue Systems with Entropy-Scaled Steering Vectors for Topic Maintenance
Heejae Suh | Yejin Jeon | Deokhyung Kang | Taehee Park | Yejin Min | Gary Lee
Findings of the Association for Computational Linguistics: ACL 2025

Small large language models (sLLMs) offer the advantage of being lightweight and efficient, which makes them suitable for resource-constrained environments. However, sLLMs often struggle to maintain topic consistency in task-oriented dialogue systems, which is critical for scenarios such as service chatbots. Specifically, it is important to ensure that the model denies off-topic or malicious inputs and adheres to its intended functionality so as to prevent potential misuse and uphold reliability. Towards this, existing activation engineering approaches have been proposed to manipulate internal activations during inference. While these methods are effective in certain scenarios, our preliminary experiments reveal their limitations in ensuring topic adherence. Therefore, to address this, we propose a novel approach termed Entropy-scaled Steering vectors for Topic Maintenance (EnSToM). EnSToM dynamically adjusts the steering intensity based on input uncertainty, which allows the model to handle off-topic distractors effectively while preserving on-topic accuracy. Our experiments demonstrate that EnSToM achieves significant performance gain with a relatively small data size compared to fine-tuning approaches. By improving topic adherence without compromising efficiency, our approach provides a robust solution for enhancing sLLM-based dialogue systems.

2024

Autoregressive Score Generation for Multi-trait Essay Scoring
Heejin Do | Yunsu Kim | Gary Lee
Findings of the Association for Computational Linguistics: EACL 2024

Recently, encoder-only pre-trained models such as BERT have been successfully applied in automated essay scoring (AES) to predict a single overall score. However, studies have yet to explore these models in multi-trait AES, possibly due to the inefficiency of replicating BERT-based models for each trait. Breaking away from the existing sole use of *encoder*, we propose an autoregressive prediction of multi-trait scores (ArTS), incorporating a *decoding* process by leveraging the pre-trained T5. Unlike prior regression or classification methods, we redefine AES as a score-generation task, allowing a single model to predict multiple scores. During decoding, the subsequent trait prediction can benefit by conditioning on the preceding trait scores. Experimental results proved the efficacy of ArTS, showing over 5% average improvements in both prompts and traits.

Multi-Level Attention Aggregation for Language-Agnostic Speaker Replication
Yejin Jeon | Gary Lee
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

This paper explores the task of language-agnostic speaker replication, a novel endeavor that seeks to replicate a speaker’s voice irrespective of the language they are speaking. Towards this end, we introduce a multi-level attention aggregation approach that systematically probes and amplifies various speaker-specific attributes in a hierarchical manner. Through rigorous evaluations across a wide range of scenarios including seen and unseen speakers conversing in seen and unseen lingua, we establish that our proposed model is able to achieve substantial speaker similarity, and is able to generalize to out-of-domain (OOD) cases.

DiagESC: Dialogue Synthesis for Integrating Depression Diagnosis into Emotional Support Conversation
Seungyeon Seo | Gary Geunbae Lee
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Dialogue systems for mental health care aim to provide appropriate support to individuals experiencing mental distress. While extensive research has been conducted to deliver adequate emotional support, existing studies cannot identify individuals who require professional medical intervention and cannot offer suitable guidance. We introduce the Diagnostic Emotional Support Conversation task for an advanced mental health management system. We develop the DESC dataset to assess depression symptoms while maintaining user experience by utilizing task-specific utterance generation prompts and a strict filtering algorithm. Evaluations by professional psychological counselors indicate that DESC has a superior ability to diagnose depression than existing data. Additionally, conversational quality evaluation reveals that DESC maintains fluent, consistent, and coherent dialogues.

Multi-Dimensional Optimization for Text Summarization via Reinforcement Learning
Sangwon Ryu | Heejin Do | Yunsu Kim | Gary Lee | Jungseul Ok
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The evaluation of summary quality encompasses diverse dimensions such as consistency, coherence, relevance, and fluency. However, existing summarization methods often target a specific dimension, facing challenges in generating well-balanced summaries across multiple dimensions. In this paper, we propose multi-objective reinforcement learning tailored to generate balanced summaries across all four dimensions. We introduce two multi-dimensional optimization (MDO) strategies for adaptive learning: 1) MDO_min, rewarding the current lowest dimension score, and 2) MDO_pro, optimizing multiple dimensions similar to multi-task learning, resolves conflicting gradients across dimensions through gradient projection. Unlike prior ROUGE-based rewards relying on reference summaries, we use a QA-based reward model that aligns with human preferences. Further, we discover the capability to regulate the length of summaries by adjusting the discount factor, seeking the generation of concise yet informative summaries that encapsulate crucial points. Our approach achieved substantial performance gains compared to baseline models on representative summarization datasets, particularly in the overlooked dimensions.

Leveraging the Interplay between Syntactic and Acoustic Cues for Optimizing Korean TTS Pause Formation
Yejin Jeon | Yunsu Kim | Gary Geunbae Lee
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Contemporary neural speech synthesis models have indeed demonstrated remarkable proficiency in synthetic speech generation as they have attained a level of quality comparable to that of human-produced speech. Nevertheless, it is important to note that these achievements have predominantly been verified within the context of high-resource languages such as English. Furthermore, the Tacotron and FastSpeech variants show substantial pausing errors when applied to the Korean language, which affects speech perception and naturalness. In order to address the aforementioned issues, we propose a novel framework that incorporates comprehensive modeling of both syntactic and acoustic cues that are associated with pausing patterns. Remarkably, our framework possesses the capability to consistently generate natural speech even for considerably more extended and intricate out-of-domain (OOD) sentences, despite its training on short audio clips. Architectural design choices are validated through comparisons with baseline models and ablation studies using subjective and objective metrics, thus confirming model performance.

Cross-lingual Transfer for Automatic Question Generation by Learning Interrogative Structures in Target Languages
Seonjeong Hwang | Yunsu Kim | Gary Lee
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Automatic question generation (QG) serves a wide range of purposes, such as augmenting question-answering (QA) corpora, enhancing chatbot systems, and developing educational materials. Despite its importance, most existing datasets predominantly focus on English, resulting in a considerable gap in data availability for other languages. Cross-lingual transfer for QG (XLT-QG) addresses this limitation by allowing models trained on high-resource language datasets to generate questions in low-resource languages. In this paper, we propose a simple and efficient XLT-QG method that operates without the need for monolingual, parallel, or labeled data in the target language, utilizing a small language model. Our model, trained solely on English QA datasets, learns interrogative structures from a limited set of question exemplars, which are then applied to generate questions in the target language. Experimental results show that our method outperforms several XLT-QG baselines and achieves performance comparable to GPT-3.5-turbo across different languages. Additionally, the synthetic data generated by our model proves beneficial for training multilingual QA models. With significantly fewer parameters than large language models and without requiring additional training for target languages, our approach offers an effective solution for QG and QA tasks across various languages.

An Investigation into Explainable Audio Hate Speech Detection
Jinmyeong An | Wonjun Lee | Yejin Jeon | Jungseul Ok | Yunsu Kim | Gary Geunbae Lee
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Research on hate speech has predominantly revolved around the detection and interpretation from textual inputs, leaving verbal content largely unexplored. Moreover, while there has been some limited exploration into hate speech detection within verbal acoustic speech inputs, the aspect of interpretability has been overlooked. As such, we introduce a new task within the audio hate speech detection task domain - we specifically aim to identify specific time frames of hate speech within audio utterances. Towards this, we propose two different approaches, cascading and End-to-End (E2E). The first cascading approach initially converts audio to transcripts, identifies hate speech within these transcripts, and subsequently locates the corresponding audio time frames. Conversely, the second E2E approach processes audio utterances directly, which allows it to pinpoint hate speech within specific time frames. Moreover, due to the lack of explainable audio hate speech datasets that include frame-level rationales, we curated a synthetic audio dataset to train our models. We further validate these models on actual human speech utterances and we find that the E2E approach outperforms the cascading method in terms of audio frame Intersection over Union (IoU) metric. Furthermore, we observe that the inclusion of frame-level rationales significantly enhances hate speech detection accuracy for both E2E and cascading approaches.

Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-Speech
Youngjae Kim | Yejin Jeon | Gary Lee
Findings of the Association for Computational Linguistics: EMNLP 2024

The difficulty of acquiring abundant, high-quality data, especially in multi-lingual contexts, has sparked interest in addressing low-resource scenarios. Moreover, current literature rely on fixed expressions from language IDs, which results in the inadequate learning of language representations, and the failure to generate speech in unseen languages. To address these challenges, we propose a novel method that directly extracts linguistic features from audio input while effectively filtering out miscellaneous acoustic information including speaker-specific attributes like timbre. Subjective and objective evaluations affirm the effectiveness of our approach for multi-lingual text-to-speech, and highlight its superiority in low-resource transfer learning for previously unseen language.

Denoising Table-Text Retrieval for Open-Domain Question Answering
Deokhyung Kang | Baikjin Jung | Yunsu Kim | Gary Geunbae Lee
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In table-text open-domain question answering, a retriever system retrieves relevant evidence from tables and text to answer questions. Previous studies in table-text open-domain question answering have two common challenges: firstly, their retrievers can be affected by false-positive labels in training datasets; secondly, they may struggle to provide appropriate evidence for questions that require reasoning across the table. To address these issues, we propose Denoised Table-Text Retriever (DoTTeR). Our approach involves utilizing a denoised training dataset with fewer false positive labels by discarding instances with lower question-relevance scores measured through a false positive detection model. Subsequently, we integrate table-level ranking information into the retriever to assist in finding evidence for questions that demand reasoning across the table. To encode this ranking information, we fine-tune a rank-aware column encoder to identify minimum and maximum values within a column. Experimental results demonstrate that DoTTeR significantly outperforms strong baselines on both retrieval recall and downstream QA tasks. Our code is available at https://github.com/deokhk/DoTTeR.

Explainable Multi-hop Question Generation: An End-to-End Approach without Intermediate Question Labeling
Seonjeong Hwang | Yunsu Kim | Gary Geunbae Lee
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In response to the increasing use of interactive artificial intelligence, the demand for the capacity to handle complex questions has increased. Multi-hop question generation aims to generate complex questions that requires multi-step reasoning over several documents. Previous studies have predominantly utilized end-to-end models, wherein questions are decoded based on the representation of context documents. However, these approaches lack the ability to explain the reasoning process behind the generated multi-hop questions. Additionally, the question rewriting approach, which incrementally increases the question complexity, also has limitations due to the requirement of labeling data for intermediate-stage questions. In this paper, we introduce an end-to-end question rewriting model that increases question complexity through sequential rewriting. The proposed model has the advantage of training with only the final multi-hop questions, without intermediate questions. Experimental results demonstrate the effectiveness of our model in generating complex questions, particularly 3- and 4-hop questions, which are appropriately paired with input answers. We also prove that our model logically and incrementally increases the complexity of questions, and the generated multi-hop questions are also beneficial for training question answering models.

Enhancing Dialogue Speech Recognition with Robust Contextual Awareness via Noise Representation Learning
Wonjun Lee | San Kim | Gary Geunbae Lee
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Recent dialogue systems typically operate through turn-based spoken interactions between users and agents. These systems heavily depend on accurate Automatic Speech Recognition (ASR), as transcription errors can significantly degrade performance in downstream dialogue tasks. To alleviate this challenge, robust ASR is required, and one effective method is to utilize the dialogue context from user and agent interactions for transcribing the subsequent user utterance. This method incorporates the transcription of the user’s speech and the agent’s response as model input, using the accumulated context generated by each turn. However, this context is susceptible to ASR errors because the ASR model generates it auto-regressively. Such noisy context can further degrade the benefits of context input, resulting in suboptimal ASR performance. In this paper, we introduce context noise representation learning to enhance robustness against noisy context, ultimately improving dialogue speech recognition accuracy. To maximize the advantage of context awareness, our approach involves decoder pre-training with text-based dialogue data and noise representation learning for a context encoder. Evaluated on DSTC11 (MultiWoZ 2.1 audio dialogues), it achieves a 24% relative reduction in Word Error Rate (WER) compared to wav2vec2.0 baselines and a 13% reduction compared to Whisper-large-v2. Notably, in noisy environments where user speech is barely audible, our method proves its effectiveness by utilizing contextual information for accurate transcription. Tested on audio data with strong noise level (Signal Noise Ratio of 0dB), our approach shows up to a 31% relative WER reduction compared to the wav2vec2.0 baseline, providing a reassuring solution for real-world noisy scenarios.

Cross-lingual Back-Parsing: Utterance Synthesis from Meaning Representation for Zero-Resource Semantic Parsing
Deokhyung Kang | Seonjeong Hwang | Yunsu Kim | Gary Lee
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Recent efforts have aimed to utilize multilingual pretrained language models (mPLMs) to extend semantic parsing (SP) across multiple languages without requiring extensive annotations. However, achieving zero-shot cross-lingual transfer for SP remains challenging, leading to a performance gap between source and target languages. In this study, we propose Cross-Lingual Back-Parsing (CBP), a novel data augmentation methodology designed to enhance cross-lingual transfer for SP. Leveraging the representation geometry of the mPLMs, CBP synthesizes target language utterances from source meaning representations. Our methodology effectively performs cross-lingual data augmentation in challenging zero-resource settings, by utilizing only labeled data in the source language and monolingual corpora. Extensive experiments on two cross-language SP benchmarks (Mschema2QA and Xspider) demonstrate that CBP brings substantial gains in the target language. Further analysis of the synthesized utterances shows that our method successfully generates target language utterances with high slot value alignment rates while preserving semantic integrity. Our codes and data are publicly available at https://github.com/deokhk/CBP.

Autoregressive Multi-trait Essay Scoring via Reinforcement Learning with Scoring-aware Multiple Rewards
Heejin Do | Sangwon Ryu | Gary Lee
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Recent advances in automated essay scoring (AES) have shifted towards evaluating multiple traits to provide enriched feedback. Like typical AES systems, multi-trait AES employs the quadratic weighted kappa (QWK) to measure agreement with human raters, aligning closely with the rating schema; however, its non-differentiable nature prevents its direct use in neural network training. In this paper, we propose Scoring-aware Multi-reward Reinforcement Learning (SaMRL), which integrates actual evaluation schemes into the training process by designing QWK-based rewards with a mean-squared error penalty for multi-trait AES. Existing reinforcement learning (RL) applications in AES are limited to classification models despite associated performance degradation, as RL requires probability distributions; instead, we adopt an autoregressive score generation framework to leverage token generation probabilities for robust multi-trait score predictions. Empirical analyses demonstrate that SaMRL facilitates model training, notably enhancing scoring of previously inferior prompts.

Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with Minimal Impact on Coherence and Evasiveness in Dialogue Agents
San Kim | Gary Lee
Findings of the Association for Computational Linguistics: NAACL 2024

Recent advancements in open-domain dialogue systems have been propelled by the emergence of high-quality large language models (LLMs) and various effective training methodologies. Nevertheless, the presence of toxicity within these models presents a significant challenge that can potentially diminish the user experience. In this study, we introduce an innovative training algorithm, an improvement upon direct preference optimization (DPO), called adversarial DPO (ADPO). The ADPO algorithm is designed to train models to assign higher probability distributions to preferred responses and lower distributions to unsafe responses, which are self-generated using the toxic control token. We demonstrate that ADPO enhances the model’s resilience against harmful conversations while minimizing performance degradation. Furthermore, we illustrate that ADPO offers a more stable training procedure compared to the traditional DPO. To the best of our knowledge, this is the first adaptation of the DPO algorithm that directly incorporates harmful data into the generative model, thereby reducing the need to artificially create safe dialogue data.

2023

Prompt- and Trait Relation-aware Cross-prompt Essay Trait Scoring
Heejin Do | Yunsu Kim | Gary Geunbae Lee
Findings of the Association for Computational Linguistics: ACL 2023

Automated essay scoring (AES) aims to score essays written for a given prompt, which defines the writing topic. Most existing AES systems assume to grade essays of the same prompt as used in training and assign only a holistic score. However, such settings conflict with real-education situations; pre-graded essays for a particular prompt are lacking, and detailed trait scores of sub-rubrics are required. Thus, predicting various trait scores of unseen-prompt essays (called cross-prompt essay trait scoring) is a remaining challenge of AES. In this paper, we propose a robust model: prompt- and trait relation-aware cross-prompt essay trait scorer. We encode prompt-aware essay representation by essay-prompt attention and utilizing the topic-coherence feature extracted by the topic-modeling mechanism without access to labeled data; therefore, our model considers the prompt adherence of an essay, even in a cross-prompt setting. To facilitate multi-trait scoring, we design trait-similarity loss that encapsulates the correlations of traits. Experiments prove the efficacy of our model, showing state-of-the-art results for all prompts and traits. Significant improvements in low-resource-prompt and inferior traits further indicate our model’s strength.

DORIC : Domain Robust Fine-Tuning for Open Intent Clustering through Dependency Parsing
Jihyun Lee | Seungyeon Seo | Yunsu Kim | Gary Geunbae Lee
Proceedings of the Eleventh Dialog System Technology Challenge

We present our work on Track 2 in the Dialog System Technology Challenges 11 (DSTC11). DSTC11-Track2 aims to provide a benchmark for zero-shot, cross-domain, intent-set induction. In the absence of in-domain training dataset, robust utterance representation that can be used across domains is necessary to induce users’ intentions. To achieve this, we leveraged a multi-domain dialogue dataset to fine-tune the language model and proposed extracting Verb-Object pairs to remove the artifacts of unnecessary information. Furthermore, we devised the method that generates each cluster’s name for the explainability of clustered results. Our approach achieved 3rd place in the precision score and showed superior accuracy and normalized mutual information (NMI) score than the baseline model on various domain datasets.

Exploring Back Translation with Typo Noise for Enhanced Inquiry Understanding in Task-Oriented Dialogue
Jihyun Lee | Junseok Kim | Gary Geunbae Lee
Proceedings of the Eleventh Dialog System Technology Challenge

This paper presents our approach to the DSTC11 Track 5 selection task, which focuses on retrieving appropriate natural language knowledge sources for task-oriented dialogue. We propose typologically diverse back-translation method with typo noise, which could generate various structured user inquries. Through our noised back translation, we augmented inquiries by combining three different typologies of language sources with five different typo noise injections. Our experiments demonstrate that typological variety and typo noise aids the model in generalizing to diverse user inquiries in dialogue. In the competition, where 14 teams participated, our approach achieved the 5th rank for exact matching metric.

2022

Multi-Type Conversational Question-Answer Generation with Closed-ended and Unanswerable Questions
Seonjeong Hwang | Yunsu Kim | Gary Geunbae Lee
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Conversational question answering (CQA) facilitates an incremental and interactive understanding of a given context, but building a CQA system is difficult for many domains due to the problem of data scarcity. In this paper, we introduce a novel method to synthesize data for CQA with various question types, including open-ended, closed-ended, and unanswerable questions. We design a different generation flow for each question type and effectively combine them in a single, shared framework. Moreover, we devise a hierarchical answerability classification (hierarchical AC) module that improves quality of the synthetic data while acquiring unanswerable questions. Manual inspections show that synthetic data generated with our framework have characteristics very similar to those of human-generated conversations. Across four domains, CQA systems trained on our synthetic data indeed show good performance close to the systems trained on human-annotated data.

Schema Encoding for Transferable Dialogue State Tracking
Hyunmin Jeon | Gary Geunbae Lee
Proceedings of the 29th International Conference on Computational Linguistics

Dialogue state tracking (DST) is an essential sub-task for task-oriented dialogue systems. Recent work has focused on deep neural models for DST. However, the neural models require a large dataset for training. Furthermore, applying them to another domain needs a new dataset because the neural models are generally trained to imitate the given dataset. In this paper, we propose Schema Encoding for Transferable Dialogue State Tracking (SET-DST), which is a neural DST method for effective transfer to new domains. Transferable DST could assist developments of dialogue systems even with few dataset on target domains. We use a schema encoder not just to imitate the dataset but to comprehend the schema of the dataset. We aim to transfer the model to new domains by encoding new schemas and using them for DST on multi-domain settings. As a result, SET-DST improved the joint accuracy by 1.46 points on MultiWOZ 2.1.

Conversational QA Dataset Generation with Answer Revision
Seonjeong Hwang | Gary Geunbae Lee
Proceedings of the 29th International Conference on Computational Linguistics

Conversational question-answer generation is a task that automatically generates a large-scale conversational question answering dataset based on input passages. In this paper, we introduce a novel framework that extracts question-worthy phrases from a passage and then generates corresponding questions considering previous conversations. In particular, our framework revises the extracted answers after generating questions so that answers exactly match paired questions. Experimental results show that our simple answer revision approach leads to significant improvement in the quality of synthetic data. Moreover, we prove that our framework can be effectively utilized for domain adaptation of conversational question answering.

2018

Out-of-domain Detection based on Generative Adversarial Network
Seonghan Ryu | Sangjun Koo | Hwanjo Yu | Gary Geunbae Lee
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

The main goal of this paper is to develop out-of-domain (OOD) detection for dialog systems. We propose to use only in-domain (IND) sentences to build a generative adversarial network (GAN) of which the discriminator generates low scores for OOD sentences. To improve basic GANs, we apply feature matching loss in the discriminator, use domain-category analysis as an additional task in the discriminator, and remove the biases in the generator. Thereby, we reduce the huge effort of collecting OOD sentences for training OOD detection. For evaluation, we experimented OOD detection on a multi-domain dialog system. The experimental results showed the proposed method was most accurate compared to the existing methods.

2015

Conversational Knowledge Teaching Agent that uses a Knowledge Base
Kyusong Lee | Paul Hongsuck Seo | Junhwi Choi | Sangjun Koo | Gary Geunbae Lee
Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Question Answering System using Multiple Information Source and Open Type Answer Merge
Seonyeong Park | Soonchoul Kwon | Byungsoo Kim | Sangdo Han | Hyosup Shim | Gary Geunbae Lee
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

Exploiting knowledge base to generate responses for natural language dialog listening agents
Sangdo Han | Jeesoo Bang | Seonghan Ryu | Gary Geunbae Lee
Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue

2014

POSTECH Grammatical Error Correction System in the CoNLL-2014 Shared Task
Kyusong Lee | Gary Geunbae Lee
Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task

2013

Counseling Dialog System with 5W1H Extraction
Sangdo Han | Kyusong Lee | Donghyeon Lee | Gary Geunbae Lee
Proceedings of the SIGDIAL 2013 Conference

2012

A Hierarchical Domain Model-Based Multi-Domain Selection Framework for Multi-Domain Dialog Systems
Seonghan Ryu | Donghyeon Lee | Injae Lee | Sangdo Han | Gary Geunbae Lee | Myungjae Kim | Kyungduk Kim
Proceedings of COLING 2012: Posters

A Meta Learning Approach to Grammatical Error Correction
Hongsuck Seo | Jonghoon Lee | Seokhwan Kim | Kyusong Lee | Sechun Kang | Gary Geunbae Lee
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Haizhou Li | Chin-Yew Lin | Miles Osborne | Gary Geunbae Lee | Jong C. Park
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Haizhou Li | Chin-Yew Lin | Miles Osborne | Gary Geunbae Lee | Jong C. Park
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relation Extraction
Seokhwan Kim | Gary Geunbae Lee
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Gary Geunbae Lee | Jonathan Ginzburg | Claire Gardent | Amanda Stent
Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Grammatical Error Annotation for Korean Learners of Spoken English
Hongsuck Seo | Kyusong Lee | Gary Geunbae Lee | Soo-Ok Kweon | Hae-Ri Kim
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The goal of our research is to build a grammatical error-tagged corpus for Korean learners of Spoken English dubbed Postech Learner Corpus. We collected raw story-telling speech from Korean university students. Transcription and annotation using the Cambridge Learner Corpus tagset were performed by six Korean annotators fluent in English. For the annotation of the corpus, we developed an annotation tool and a validation tool. After comparing human annotation with machine-recommended error tags, unmatched errors were rechecked by a native annotator. We observed different characteristics between the spoken language corpus built in this study and an existing written language corpus.

2011

A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction
Seokhwan Kim | Minwoo Jeong | Jonghoon Lee | Gary Geunbae Lee
Proceedings of 5th International Joint Conference on Natural Language Processing

POMY: A Conversational Virtual Environment for Language Learning in POSTECH
Hyungjong Noh | Kyusong Lee | Sungjin Lee | Gary Geunbae Lee
Proceedings of the SIGDIAL 2011 Conference

2010

A Cross-lingual Annotation Projection Approach for Relation Detection
Seokhwan Kim | Minwoo Jeong | Jonghoon Lee | Gary Geunbae Lee
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

2009

Automatic Agenda Graph Construction from Human-Human Dialogs using Clustering Method
Cheongjae Lee | Sangkeun Jung | Kyungduk Kim | Gary Geunbae Lee
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers

Realistic Grammar Error Simulation using Markov Logic
Sungjin Lee | Gary Geunbae Lee
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

Semi-supervised Speech Act Recognition in Emails and Forums
Minwoo Jeong | Chin-Yew Lin | Gary Geunbae Lee
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

Efficient Inference of CRFs for Large-Scale Natural Language Data
Minwoo Jeong | Chin-Yew Lin | Gary Geunbae Lee
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

Proceedings of the ACL-IJCNLP 2009 Software Demonstrations
Gary Geunbae Lee | Sabine Schulte im Walde
Proceedings of the ACL-IJCNLP 2009 Software Demonstrations

A Local Tree Alignment-based Soft Pattern Matching Approach for Information Extraction
Seokhwan Kim | Minwoo Jeong | Gary Geunbae Lee
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers

Hybrid Approach to User Intention Modeling for Dialog Simulation
Sangkeun Jung | Cheongjae Lee | Kyungduk Kim | Gary Geunbae Lee
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

2008

POSTECH machine translation system for IWSLT 2008 evaluation campaign.
Jonghoon Lee | Gary Geunbae Lee
Proceedings of the 5th International Workshop on Spoken Language Translation: Evaluation Campaign

In this paper, we describe POSTECH system for IWSLT 2008 evaluation campaign. The system is based on phrase based statistical machine translation. We set up a baseline system using well known freely available software. A preprocessing method and a language modeling method have been applied to the baseline system in order to improve machine translation quality. The preprocessing method is to identify and remove useless tokens in source texts. And the language modeling method models phrase level n-gram. We have participated in the BTEC tasks to see the effects of our methods.

Transformation-based Sentence Splitting method for Statistical Machine Translation
Jonghoon Lee | Donghyeon Lee | Gary Geunbae Lee
Proceedings of the Workshop on Technologies and Corpora for Asia-Pacific Speech Translation (TCAST)

Robust Dialog Management with N-Best Hypotheses Using Dialog Examples and Agenda
Cheongjae Lee | Sangkeun Jung | Gary Geunbae Lee
Proceedings of ACL-08: HLT

A Frame-Based Probabilistic Framework for Spoken Dialog Management Using Dialog Examples
Kyungduk Kim | Cheongjae Lee | Sangkeun Jung | Gary Geunbae Lee
Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue

An Integrated Dialog Simulation Technique for Evaluating Spoken Dialog Systems
Sangkeun Jung | Cheongjae Lee | Kyungduk Kim | Gary Geunbae Lee
Coling 2008: Proceedings of the workshop on Speech Processing for Safety Critical Translation and Pervasive Applications

2007

POSSLT: A Korean to English Spoken Language Translation System
Donghyeon Lee | Jonghoon Lee | Gary Geunbae Lee
Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT)

A Joint Statistical Model for Simultaneous Word Spacing and Spelling Error Correction for Korean
Hyungjong Noh | Jeong-Won Cha | Gary Geunbae Lee
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions

2006

Exploiting Non-Local Features for Spoken Language Understanding
Minwoo Jeong | Gary Geunbae Lee
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

MMR-based Active Machine Learning for Bio Named Entity Recognition
Seokhwan Kim | Yu Song | Kyungduk Kim | Jeong-Won Cha | Gary Geunbae Lee
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers

2005

Heuristic Methods for Reducing Errors of Geographic Named Entities Learned by Bootstrapping
Seungwoo Lee | Gary Geunbae Lee
Second International Joint Conference on Natural Language Processing: Full Papers

POSBIOTM/W: A Development Workbench for Machine Learning Oriented Biomedical Text Mining System
Kyungduk Kim | Yu Song | Gary Geunbae Lee
Proceedings of HLT/EMNLP 2005 Interactive Demonstrations

2004

Using Higher-level Linguistic Knowledge for Speech Recognition Error Correction in a Spoken Q/A Dialog
Minwoo Jeong | Byeongchang Kim | Gary Geunbae Lee
Proceedings of the HLT-NAACL 2004 Workshop on Spoken Language Understanding for Conversational Systems and Higher Level Linguistic Information for Speech Processing

POSBIOTM-NER in the Shared Task of BioNLP/NLPBA2004
Yu Song | Eunju Kim | Gary Geunbae Lee | Byoung-kee Yi
Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP)

MMR-based Feature Selection for Text Categorization
Changki Lee | Gary Geunbae Lee
Proceedings of HLT-NAACL 2004: Short Papers

2003

Automatic Acquisition of Named Entity Tagged Corpus from World Wide Web
Joohui An | Seungwoo Lee | Gary Geunbae Lee
The Companion Volume to the Proceedings of 41st Annual Meeting of the Association for Computational Linguistics

2002

Multilingual Question Answering with High Portability on Relational Databases
Hanmin Jung | Gary Geunbae Lee
COLING-02: Multilingual Summarization and Question Answering

Syllable-Pattern-Based Unknown-Morpheme Segmentation and Estimation for Hybrid Part-of-Speech Tagging of Korean
Gary Geunbae Lee | Jeongwon Cha | Jong-Hyeok Lee
Computational Linguistics, Volume 28, Number 1, March 2002

2001

Automatic Corpus-based Tone Prediction using K-ToBI Representation
Jin-Seok Lee | Byeongchang Kim | Gary Geunbae Lee
Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing

MAYA: A Fast Question-answering System Based on a Predictive Answer Indexer
Harksoo Kim | Kyungsun Kim | Gary Geunbae Lee | Jungyun Seo
Proceedings of the ACL 2001 Workshop on Open-Domain Question Answering

2000

Decision-Tree based Error Correction for Statistical Phrase Break Prediction in Korean
Byeongchang Kim | Geunbae Lee
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics

POSCAT: A Morpheme-based Speech Corpus Annotation Tool
Byeongchang Kim | Jin-seok Lee | Jeongwon Cha | Geunbae Lee
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

Structural disambiguation of morpho-syntactic categorial parsing for Korean
Jeongwon Cha | Geunbae Lee
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics

Corpus-Based Learning of Compound Noun Indexing
Byung-Kwan Kwak | Jee-Hyub Kim | Geunbae Lee | Jung Yun Seo
ACL-2000 Workshop on Recent Advances in Natural Language Processing and Information Retrieval

1998

Unlimited Vocabulary Grapheme to Phoneme Conversion for Korean TTS
Byeongchang Kim | WonIl Lee | Geunbae Lee | Jong-Hyeok Lee
COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics

Generalized unknown morpheme guessing for hybrid POS tagging of Korean
Jeongwon Cha | Geunbae Lee | Jong-Hyeok Lee
Sixth Workshop on Very Large Corpora

Identifying Syntactic Role of Antecedent in Korean Relative Clause Using Corpus and Thesaurus Information
Hui-Feng Li | Jong-Hyeok Lee | Geunbae Lee
COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics

Unlimited Vocabulary Grapheme to Phoneme Conversion for Korean TTS
Byeongchang Kim | WonIl Lee | Geunbae Lee | Jong-Hyeok Lee
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1

Identifying Syntactic Role of Antecedent in Korean Relative Clause using Corpus and Thesaurus Informationes
Hui-Feng Li | Jong-Hyeok Lee | Geunbae Lee
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 2

1994

Table-driven Neural Syntactic Analysis of Spoken Korean
Wonll Lee | Geunbae Lee | Jong-Hyeok Lee
COLING 1994 Volume 2: The 15th International Conference on Computational Linguistics

Co-authors

Deokhyung Kang 8

Jong-Hyeok Lee 7

Jeong-Won Cha 6

Byeongchang Kim 6

Sangkeun Jung 5

Cheongjae Lee 5

Donghyeon Lee 4

Seungyeon Seo 3

Hyungjong Noh 2

Miles Osborne 2

Claire Gardent 1

Jonathan Ginzburg 1

Seokhoon Kang 1

Byung-Kwan Kwak 1

Soonchoul Kwon 1

Seonyeong Park 1

Sabine Schulte im Walde 1

Sung Jun Yang 1

Byoung-kee Yi 1

Venues