Jihyun Lee

2025

pdf bib abs
Prompt-Guided Selective Masking Loss for Context-Aware Emotive Text-to-Speech
Yejin Jeon | Youngjae Kim | Jihyun Lee | Gary Lee
Findings of the Association for Computational Linguistics: NAACL 2025

Emotional dialogue speech synthesis (EDSS) aims to generate expressive speech by leveraging the dialogue context between interlocutors. This is typically done by concatenating global representations of previous utterances as conditions for text-to-speech (TTS) systems. However, such approaches overlook the importance of integrating localized acoustic cues that convey emotion. To address this, we introduce a novel approach that utilizes a large language model (LLM) to generate holistic emotion tags based on prior dialogue context, while also pinpointing key words in the target utterance that align with the predicted emotional state. Furthermore, we enhance the emotional richness of synthesized speech by incorporating concentrated acoustic features of these key words through a novel selective audio masking loss function. This methodology not only improves emotional expressiveness, but also facilitates automatic emotion speech generation during inference by eliminating the need for manual emotion tag selection. Comprehensive subjective and objective evaluations and analyses demonstrate the effectiveness of the proposed approach.

pdf bib abs
PicPersona-TOD : A Dataset for Personalizing Utterance Style in Task-Oriented Dialogue with Image Persona
Jihyun Lee | Yejin Jeon | Seungyeon Seo | Gary Lee
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Task-Oriented Dialogue (TOD) systems are designed to fulfill user requests through natural language interactions, yet existing systems often produce generic, monotonic responses that lack individuality and fail to adapt to users’ personal attributes. To address this, we introduce PicPersona-TOD, a novel dataset that incorporates user images as part of the persona, enabling personalized responses tailored to user-specific factors such as age or emotional context. This is facilitated by first impressions, dialogue policy-guided prompting, and the use of external knowledge to reduce hallucinations. Human evaluations confirm that our dataset enhances user experience, with personalized responses contributing to a more engaging interaction. Additionally, we introduce a new NLG model, Pictor, which not only personalizes responses, but also demonstrates robust performance across unseen domains.

2023

pdf bib abs
DORIC : Domain Robust Fine-Tuning for Open Intent Clustering through Dependency Parsing
Jihyun Lee | Seungyeon Seo | Yunsu Kim | Gary Geunbae Lee
Proceedings of the Eleventh Dialog System Technology Challenge

We present our work on Track 2 in the Dialog System Technology Challenges 11 (DSTC11). DSTC11-Track2 aims to provide a benchmark for zero-shot, cross-domain, intent-set induction. In the absence of in-domain training dataset, robust utterance representation that can be used across domains is necessary to induce users’ intentions. To achieve this, we leveraged a multi-domain dialogue dataset to fine-tune the language model and proposed extracting Verb-Object pairs to remove the artifacts of unnecessary information. Furthermore, we devised the method that generates each cluster’s name for the explainability of clustered results. Our approach achieved 3rd place in the precision score and showed superior accuracy and normalized mutual information (NMI) score than the baseline model on various domain datasets.

pdf bib abs
Exploring Back Translation with Typo Noise for Enhanced Inquiry Understanding in Task-Oriented Dialogue
Jihyun Lee | Junseok Kim | Gary Geunbae Lee
Proceedings of the Eleventh Dialog System Technology Challenge

This paper presents our approach to the DSTC11 Track 5 selection task, which focuses on retrieving appropriate natural language knowledge sources for task-oriented dialogue. We propose typologically diverse back-translation method with typo noise, which could generate various structured user inquries. Through our noised back translation, we augmented inquiries by combining three different typologies of language sources with five different typo noise injections. Our experiments demonstrate that typological variety and typo noise aids the model in generalizing to diverse user inquiries in dialogue. In the competition, where 14 teams participated, our approach achieved the 5th rank for exact matching metric.

Jihyun Lee

2025

2023

2016

Co-authors

Venues