Yohan Lee
Unverified author pages with similar names: Yohan Lee
2025
Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval
Yohan Lee | Yongwoo Song | Sangyeop Kim
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Yohan Lee | Yongwoo Song | Sangyeop Kim
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
We present the Conversational Data Retrieval (CDR) benchmark, the first comprehensive test set for evaluating systems that retrieve conversation data for product insights. With 1.6k queries across five analytical tasks and 9.1k conversations, our benchmark provides a reliable standard for measuring conversational data retrieval performance. Our evaluation of 16 popular embedding models shows that even the best models reach only around NDCG@10 of 0.51, revealing a substantial gap between document and conversational data retrieval capabilities. Our work identifies unique challenges in conversational data retrieval (implicit state recognition, turn dynamics, contextual references) while providing practical query templates and detailed error analysis across different task categories. The benchmark dataset and code are available at https://github.com/l-yohai/CDR-Benchmark.
Pre-Storage Reasoning for Episodic Memory: Shifting Inference Burden to Memory for Personalized Dialogue
Sangyeop Kim | Yohan Lee | Sanghwa Kim | Hyunjong Kim | Sungzoon Cho
Findings of the Association for Computational Linguistics: EMNLP 2025
Sangyeop Kim | Yohan Lee | Sanghwa Kim | Hyunjong Kim | Sungzoon Cho
Findings of the Association for Computational Linguistics: EMNLP 2025
Effective long-term memory in conversational AI requires synthesizing information across multiple sessions. However, current systems place excessive reasoning burden on response generation, making performance significantly dependent on model sizes. We introduce PREMem (Pre-storage Reasoning for Episodic Memory), a novel approach that shifts complex reasoning processes from inference to memory construction. PREMem extracts fine-grained memory fragments categorized into factual, experiential, and subjective information; it then establishes explicit relationships between memory items across sessions, capturing evolution patterns like extensions, transformations, and implications. By performing this reasoning during pre-storage rather than when generating a response, PREMem creates enriched representations while reducing computational demands during interactions. Experiments show significant performance improvements across all model sizes, with smaller models achieving results comparable to much larger baselines while maintaining effectiveness even with constrained token budgets. Code and dataset are available at https://github.com/sangyeop-kim/PREMem.
HEISIR: Hierarchical Expansion of Inverted Semantic Indexing for Training-free Retrieval of Conversational Data using LLMs
Sangyeop Kim | Hangyeul Lee | Yohan Lee
Findings of the Association for Computational Linguistics: NAACL 2025
Sangyeop Kim | Hangyeul Lee | Yohan Lee
Findings of the Association for Computational Linguistics: NAACL 2025
The growth of conversational AI services has increased demand for effective information retrieval from dialogue data. However, existing methods often face challenges in capturing semantic intent or require extensive labeling and fine-tuning. This paper introduces HEISIR (Hierarchical Expansion of Inverted Semantic Indexing for Retrieval), a novel framework that enhances semantic understanding in conversational data retrieval through optimized data ingestion, eliminating the need for resource-intensive labeling or model adaptation.HEISIR implements a two-step process: (1) Hierarchical Triplets Formulation and (2) Adjunct Augmentation, creating semantic indices consisting of Subject-Verb-Object-Adjunct (SVOA) quadruplets. This structured representation effectively captures the underlying semantic information from dialogue content. HEISIR achieves high retrieval performance while maintaining low latency during the actual retrieval process. Our experimental results demonstrate that HEISIR outperforms fine-tuned models across various embedding types and language models. Beyond improving retrieval capabilities, HEISIR also offers opportunities for intent and topic analysis in conversational data, providing a versatile solution for dialogue systems.
What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs
Sangyeop Kim | Yohan Lee | Yongwoo Song | Kimin Lee
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Sangyeop Kim | Yohan Lee | Yongwoo Song | Kimin Lee
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We investigate long-context vulnerabilities in Large Language Models (LLMs) through Many-Shot Jailbreaking (MSJ). Our experiments utilize context length of up to 128K tokens. Through comprehensive analysis with various many-shot attack settings with different instruction styles, shot density, topic, and format, we reveal that context length is the primary factor determining attack effectiveness. Critically, we find that successful attacks do not require carefully crafted harmful content. Even repetitive shots or random dummy text can circumvent model safety measures, suggesting fundamental limitations in long-context processing capabilities of LLMs. The safety behavior of well-aligned models becomes increasingly inconsistent with longer contexts. These findings highlight significant safety gaps in context expansion capabilities of LLMs, emphasizing the need for new safety mechanisms.