Thomas Steffek

2026

CliniBench: A Clinical Outcome Prediction Benchmark for Generative and Encoder-Based Language Models
Paul Grundmann | Jan Frick | Dennis Fast | Thomas Steffek | Felix Gers | Wolfgang Nejdl | Alexander Löser
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

With their growing capabilities, generative large language models (LLMs) are being increasingly investigated for complex medical tasks.However, their effectiveness in real-world clinical applications remains underexplored. To address this, we present CliniBench, the first benchmark that enables comparability of well-studied encoder-based classifiers and generative LLMs for discharge diagnosis prediction from admission notes in the MIMIC-IV dataset. Our extensive study compares 12 generative LLMs and 3 encoder-based classifiers and demonstrates that encoder-based classifiers consistently outperform generative models in diagnosis prediction. We assess several retrieval augmentation strategies for in-context learning from similar patients and find that they provide notable performance improvements for generative LLMs.

2024

pdf bib abs

Data Drift in Clinical Outcome Prediction from Admission Notes
Paul Grundmann | Jens-Michalis Papaioannou | Tom Oberhauser | Thomas Steffek | Amy Siu | Wolfgang Nejdl | Alexander Loeser
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Clinical NLP research faces a scarcity of publicly available datasets due to privacy concerns. MIMIC-III marked a significant milestone, enabling substantial progress, and now, with MIMIC-IV, the dataset has expanded significantly, offering a broader scope. In this paper, we focus on the task of predicting clinical outcomes from clinical text. This is crucial in modern healthcare, aiding in preventive care, differential diagnosis, and capacity planning. We introduce a novel clinical outcome prediction dataset derived from MIMIC-IV. Furthermore, we provide initial insights into the performance of models trained on MIMIC-III when applied to our new dataset, with specific attention to potential data drift. We investigate challenges tied to evolving documentation standards and changing codes in the International Classification of Diseases (ICD) taxonomy, such as the transition from ICD-9 to ICD-10. We also explore variations in clinical text across different hospital wards. Our study aims to probe the robustness and generalization of clinical outcome prediction models, contributing to the ongoing advancement of clinical NLP in healthcare.

Co-authors

Alexander Loeser 1

Alexander Löser 1

Tom Oberhauser 1

Jens-Michalis Papaioannou 1

Amy Siu 1

Venues

Fix author