Siun Kim

2026

DiZiNER: Disagreement-guided Instruction Refinement via Simulating Pilot Annotation for Zero-shot Named Entity Recognition
Siun Kim | Hyung-Jin Yoon
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) have advanced information extraction (IE) by enabling zero-shot and few-shot named entity recognition (NER), yet their generative outputs still show persistent and systematic errors. Despite progress through instruction fine-tuning, zero-shot NER still lags far behind supervised systems. These recurring errors mirror inconsistencies observed in early-stage human annotation processes that resolve disagreements through pilot annotation. Motivated by this analogy, we introduce DiZiNER (Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition), a framework that simulates the pilot annotation process, employing LLMs to act as both annotators and supervisors. Multiple heterogeneous LLMs annotate shared texts, and a supervisor model analyzes inter-model disagreements to refine task instructions. Across 18 benchmarks, DiZiNER achieves zero-shot SOTA results on 14 datasets, improving prior bests by +8.0 F1 and reducing the zero-shot to supervised gap by over +11 points. It also consistently outperforms its supervisor, GPT-5 mini, indicating that improvements stem from disagreement-guided instruction refinement rather than model capacity. Pairwise agreement between models shows a strong correlation with NER performance, further supporting this finding.

2025

pdf bib abs

Questioning Our Questions: How Well Do Medical QA Benchmarks Evaluate Clinical Capabilities of Language Models?
Siun Kim | Hyung-Jin Yoon
Proceedings of the 24th Workshop on Biomedical Language Processing

Recent advances in large language models (LLMs) have led to impressive performance on medical question-answering (QA) benchmarks. However, the extent to which these benchmarks reflect real-world clinical capabilities remains uncertain. To address this gap, we systematically analyzed the correlation between LLM performance on major medical QA benchmarks (e.g., MedQA, MedMCQA, PubMedQA, and MMLU medicine subjects) and clinical performance in real-world settings. Our dataset included 702 clinical evaluations of 85 LLMs from 168 studies. Benchmark scores demonsrated a moderate correlation with clinical performance (Spearman’s rho = 0.59), albeit substantially lower than inter-benchmark correlations. Among them, MedQA was the most predictive but failed to capture essential competencies such as patient communication, longitudinal care, and clinical information extraction. Using Bayesian hierarchical modeling, we estimated representative clinical performance and identified GPT-4 and GPT-4o as consistently top-performing models, often matching or exceeding human physicians. Despite longstanding concerns about the clinical validity of medical QA benchmarks, this study offers the first quantitative analysis of their alignment with real-world clinical performance.

2024

pdf bib abs

Eligibility criteria (EC) refer to a set of conditions an individual must meet to participate in a clinical trial, defining the study population and minimizing potential risks to patients. Previous research in clinical trial design has been primarily focused on searching for similar trials and generating EC within manual instructions, employing similarity-based performance metrics, which may not fully reflect human judgment. In this study, we propose a novel task of recommending EC based on clinical trial information, including trial titles, and introduce an automatic evaluation framework to assess the clinical validity of the EC recommendation model. Our new approach, known as CReSE (Contrastive learning and Rephrasing-based and Clinical Relevance-preserving Sentence Embedding), represents EC through contrastive learning and rephrasing via large language models (LLMs). The CReSE model outperforms existing language models pre-trained on the biomedical domain in EC clustering. Additionally, we have curated a benchmark dataset comprising 3.2M high-quality EC-title pairs extracted from 270K clinical trials available on ClinicalTrials.gov. The EC recommendation models achieve commendable performance metrics, with 49.0% precision@1 and 44.2% MAP@5 on our evaluation framework. We expect that our evaluation framework built on the CReSE model will contribute significantly to the development and assessment of the EC recommendation models in terms of clinical validity.

Co-authors

Jung-Hyun Won 1

Lijun Wu 1

Venues

Fix author