Keno Bressem

2026

Large language models (LLMs) show strong reasoning abilities, but full retraining for the medical domain is often infeasible because of lacking data or compute resources. We present DeepICD-R1, a framework for efficient medical reasoning fine-tuning that unites hierarchical rewards with distilled supervision. We reformulate ICD-10-CM prediction as a reinforcement learning problem and design a hierarchical outcome-based reward that reflects the ICD code structure across chapter, category, and full-code levels. In parallel, we publish a large-scale distilled dataset of over 90k reasoning traces derived from MIMIC-IV admission notes, integrating clinical validation and official coding guidelines. Fine-tuning smaller instruction-tuned LLMs with this data and GRPO reinforcement yields consistent gains in diagnostic accuracy and reasoning coherence. Extensive ablations confirm that hierarchical supervision and verifiable outcome rewards enable competitive, domain-specialized reasoning models without additional pretraining, providing a reproducible foundation for clinical NLP research. Keywords: Clinical NLP, Large Reasoning Model, GRPO, Supervised Fine-Tuning

2024

pdf bib abs

Clinical Decision Support Systems assist medical professionals in providing optimal care for patients.A prominent data source used for creating tasks for such systems is the Medical Information Mart for Intensive Care (MIMIC).MIMIC contains electronic health records (EHR) gathered in a tertiary hospital in the United States.The majority of past work is based on the third version of MIMIC, although the fourth is the most recent version.This new version, not only introduces more data into MIMIC, but also increases the variety of patients.While MIMIC-III is limited to intensive care units, MIMIC-IV also offers EHRs from the emergency department.In this work, we investigate how to adapt previous work to update clinical outcome prediction for MIMIC-IV.We revisit several established tasks, including prediction of diagnoses, procedures, length-of-stay, and also introduce a novel task: patient routing prediction.Furthermore, we quantitatively and qualitatively evaluate all tasks on several bio-medical transformer encoder models.Finally, we provide narratives for future research directions in the clinical outcome prediction domain. We make our source code publicly available to reproduce our experiments, data, and tasks.

Co-authors

Paul Grundmann 1

Wolfgang Nejdl 1

Jens-Michalis Papaioannou 1

Thomas Maximilian Josef Steffek 1

Roman Teucher 1

Peter Troeger 1

Venues

Fix author