Seungseop Lim
2026
Evaluating the Pre-Consultation Ability of LLMs using Diagnostic Guidelines
Jean Seo | Gibaeg Kim | Kihun Shin | Seungseop Lim | Hyunkyung Lee | Wooseok Han | Jongwon Lee | Eunho Yang
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)
Jean Seo | Gibaeg Kim | Kihun Shin | Seungseop Lim | Hyunkyung Lee | Wooseok Han | Jongwon Lee | Eunho Yang
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)
We introduce EPAG, a benchmark dataset and framework designed for evaluating the pre-consultation ability of LLMs using diagnostic guidelines. LLMs are evaluated directly through HPI-diagnostic guideline comparison and indirectly through disease diagnosis. In our experiments, we observe that small open-source models fine-tuned with a well-curated, task-specific dataset can outperform frontier LLMs in pre-consultation. Additionally, we find that increased amount of HPI (History of Present Illness) does not necessarily lead to improved diagnostic performance. Further experiments reveal that the language of pre-consultation influences the characteristics of the dialogue. By open-sourcing our dataset and evaluation pipeline on https://github.com/seemdog/EPAG, we aim to contribute to the evaluation and further development of LLM applications in real-world clinical settings.
2025
Format Inertia: A Failure Mechanism of LLMs in Medical Pre-Consultation
Seungseop Lim | Gibaeg Kim | Wooseok Han | Jean Seo | Hyunkyung Lee | Jaehyo Yoo | Eunho Yang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Seungseop Lim | Gibaeg Kim | Wooseok Han | Jean Seo | Hyunkyung Lee | Jaehyo Yoo | Eunho Yang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Recent advances in Large Language Models (LLMs) have brought significant improvements to various service domains, including chatbots and medical pre-consultation applications. In the healthcare domain, the most common approach for adapting LLMs to multi-turn dialogue generation is Supervised Fine-Tuning (SFT). However, datasets for SFT in tasks like medical pre-consultation typically exhibit a skewed turn-count distribution. Training on such data induces a novel failure mechanism we term **Format Inertia**, where models tend to generate repetitive, format-correct, but diagnostically uninformative questions in long medical dialogues. To mitigate this observed failure mechanism, we adopt a simple, data-centric method that rebalances the turn-count distribution of the training dataset. Experimental results show that our approach substantially alleviates Format Inertia in medical pre-consultation.
Taxonomy of Comprehensive Safety for Clinical Agents
Jean Seo | Hyunkyung Lee | Gibaeg Kim | Wooseok Han | Jaehyo Yoo | Seungseop Lim | Kihun Shin | Eunho Yang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Jean Seo | Hyunkyung Lee | Gibaeg Kim | Wooseok Han | Jaehyo Yoo | Seungseop Lim | Kihun Shin | Eunho Yang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Safety is a paramount concern in clinical chatbot applications, where inaccurate or harmful responses can lead to serious consequences. Existing methods—such as guardrails and tool-calling—often fall short in addressing the nuanced demands of the clinical domain. In this paper, we introduce TACOS(Taxonomy of Comprehensive Safety for Clinical Agents), a fine-grained, 21-class taxonomy that integrates safety filtering and tool selection into a single user intent classification step. TACOS covers a wide spectrum of clinical and non-clinical queries, explicitly modeling varying safety thresholds and external tool dependencies. To validate our taxonomy, we curate a TACOS-annotated dataset and perform extensive experiments. Our results demonstrate the value of a new taxonomy specialized for clinical agent settings, and reveal valuable insights about train data distribution and pretrained knowledge of base models.