DeepICD-R1: Medical Reasoning through Hierarchical Rewards and Unsupervised Distillation

Tom Röhr, Thomas Maximilian Josef Steffek, Roman Teucher, Keno Bressem, Alexei Figueroa, Paul Grundmann, Peter Troeger, Felix Alexander Gers, Alexander Löser


Abstract
Large language models (LLMs) show strong reasoning abilities, but full retraining for the medical domain is often infeasible because of lacking data or compute resources. We present DeepICD-R1, a framework for efficient medical reasoning fine-tuning that unites hierarchical rewards with distilled supervision. We reformulate ICD-10-CM prediction as a reinforcement learning problem and design a hierarchical outcome-based reward that reflects the ICD code structure across chapter, category, and full-code levels. In parallel, we publish a large-scale distilled dataset of over 90k reasoning traces derived from MIMIC-IV admission notes, integrating clinical validation and official coding guidelines. Fine-tuning smaller instruction-tuned LLMs with this data and GRPO reinforcement yields consistent gains in diagnostic accuracy and reasoning coherence. Extensive ablations confirm that hierarchical supervision and verifiable outcome rewards enable competitive, domain-specialized reasoning models without additional pretraining, providing a reproducible foundation for clinical NLP research. Keywords: Clinical NLP, Large Reasoning Model, GRPO, Supervised Fine-Tuning
Anthology ID:
2026.lrec-main.843
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
10764–10775
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.843/
DOI:
Bibkey:
Cite (ACL):
Tom Röhr, Thomas Maximilian Josef Steffek, Roman Teucher, Keno Bressem, Alexei Figueroa, Paul Grundmann, Peter Troeger, Felix Alexander Gers, and Alexander Löser. 2026. DeepICD-R1: Medical Reasoning through Hierarchical Rewards and Unsupervised Distillation. International Conference on Language Resources and Evaluation, main:10764–10775.
Cite (Informal):
DeepICD-R1: Medical Reasoning through Hierarchical Rewards and Unsupervised Distillation (Röhr et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.843.pdf