Gibaeg Kim


2026

We introduce EPAG, a benchmark dataset and framework designed for evaluating the pre-consultation ability of LLMs using diagnostic guidelines. LLMs are evaluated directly through HPI-diagnostic guideline comparison and indirectly through disease diagnosis. In our experiments, we observe that small open-source models fine-tuned with a well-curated, task-specific dataset can outperform frontier LLMs in pre-consultation. Additionally, we find that increased amount of HPI (History of Present Illness) does not necessarily lead to improved diagnostic performance. Further experiments reveal that the language of pre-consultation influences the characteristics of the dialogue. By open-sourcing our dataset and evaluation pipeline on https://github.com/seemdog/EPAG, we aim to contribute to the evaluation and further development of LLM applications in real-world clinical settings.
Clinical dialogue-to-note generation is challenging because clinically salient evidence is noisy, distributed across turns, and often revised later in the encounter. Direct transcript-only prompting and coarse intermediate scaffolds can therefore suffer from omissions, section leakage, unsupported fill-in, and brittle final-state tracking. We propose Clinical Atomic Propositions (CAPs), a dialogue-aware intermediate representation for faithful clinical note generation. CAPs extract source-grounded clinical assertions while preserving modifiers such as verification status, temporality, speaker/source, and action type. We also study an optional event consolidation layer that groups CAPs into problem-oriented care bundles before note rendering. We evaluate five methods on a 197-case ACI-Bench cohort: a transcript-only baseline, prompt-based reimplementations of Cluster2Sent and MEDSUM-ENT, CAP, and CAP+Event. The main task uses a sectioned-note template, with SOAP-template rendering and transcript-free rendering reported as ablations. We use MEDSUM-ENT-style GPT-R/P/F1 metrics and a proposition-grounded semCAP-R/P/F1 audit to measure concept-level and source-grounded faithfulness, complemented by case-level win/tie/loss analysis and clinician deep review. Results show that CAP improves preservation of transcript-grounded clinical propositions while remaining competitive on concept-level GPT metrics. CAP+Event is not uniformly better than CAP, but qualitative and boundary analyses show when problem-oriented consolidation can improve organization and when compression can introduce omissions. We release code, prompts, intermediate representations, generated notes, and evaluation artifacts at a public repository.

2025

Safety is a paramount concern in clinical chatbot applications, where inaccurate or harmful responses can lead to serious consequences. Existing methods—such as guardrails and tool-calling—often fall short in addressing the nuanced demands of the clinical domain. In this paper, we introduce TACOS(Taxonomy of Comprehensive Safety for Clinical Agents), a fine-grained, 21-class taxonomy that integrates safety filtering and tool selection into a single user intent classification step. TACOS covers a wide spectrum of clinical and non-clinical queries, explicitly modeling varying safety thresholds and external tool dependencies. To validate our taxonomy, we curate a TACOS-annotated dataset and perform extensive experiments. Our results demonstrate the value of a new taxonomy specialized for clinical agent settings, and reveal valuable insights about train data distribution and pretrained knowledge of base models.
Recent advances in Large Language Models (LLMs) have brought significant improvements to various service domains, including chatbots and medical pre-consultation applications. In the healthcare domain, the most common approach for adapting LLMs to multi-turn dialogue generation is Supervised Fine-Tuning (SFT). However, datasets for SFT in tasks like medical pre-consultation typically exhibit a skewed turn-count distribution. Training on such data induces a novel failure mechanism we term **Format Inertia**, where models tend to generate repetitive, format-correct, but diagnostically uninformative questions in long medical dialogues. To mitigate this observed failure mechanism, we adopt a simple, data-centric method that rebalances the turn-count distribution of the training dataset. Experimental results show that our approach substantially alleviates Format Inertia in medical pre-consultation.

2024

Hierarchical text classification (HTC) is a challenging problem with two key issues: utilizing structural information and mitigating label imbalance. Recently, the unit-based approach generating unit-based feature representations has outperformed the global approach focusing on a global feature representation. Nevertheless, unit-based models using BCE and ZLPR losses still face static thresholding and label imbalance challenges. Those challenges become more critical in large-scale hierarchies. This paper introduces a novel hierarchy-aware loss function for unit-based HTC models: Hierarchy-aware Biased Bound Margin (HBM) loss. HBM integrates learnable bounds, biases, and a margin to address static thresholding and mitigate label imbalance adaptively. Experimental results on benchmark datasets demonstrate the superior performance of HBM compared to competitive HTC models.