Developing the German Medical Text Corpus (GeMTeX): Legal Compliance and Semantic Enrichment

Justin Hofenbitzer, Christina Lohr, Andrea Riedel, Rebekka Kiser, Aliaksandra Shutsko, Abanoub Abdelmalak, Peter Klügl, Jutta Romberg, Sarah Riepenhausen, Miriam Schechner, Jakob Faller, Frank Meineke, Luise Modersohn, Markus Löffler, Juliane Fluck, Udo Hahn, Stefan Schulz, Martin Boeker


Abstract
GeMTeX is a large-scale German Medical Text Corpus project with the goal to publish a clinical national reference corpus. The resource is currently under construction and comprises, as of February 2026, more than 15k clinical documents (20M tokens) from six German university hospitals. When building GeMTeX, attention was paid to comply with European regulatory requirements. In phase I, patients were asked to allow reuse of their clinical documents based on the legal foundation of an "informed consent". In phase II, consented documents from six major clinical sites in Germany underwent a thorough de-identification process. In phase III, we currently enrich this unlocked dataset with semantic information from the clinical domain. This annotation process is guided by Snomed CT, which supports to directly ground expressions within clinical documents in a worldwide shared medical documentation and ontology standard. The resource is currently under active development and is accessible upon request under controlled access conditions. We refer interested researchers to visit https://kiinformatik.mri.tum.de/en/gemtex or reach out via gemtex.mi@mh.tum.de.
Anthology ID:
2026.lrec-main.122
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
1571–1584
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.122/
DOI:
Bibkey:
Cite (ACL):
Justin Hofenbitzer, Christina Lohr, Andrea Riedel, Rebekka Kiser, Aliaksandra Shutsko, Abanoub Abdelmalak, Peter Klügl, Jutta Romberg, Sarah Riepenhausen, Miriam Schechner, Jakob Faller, Frank Meineke, Luise Modersohn, Markus Löffler, Juliane Fluck, Udo Hahn, Stefan Schulz, and Martin Boeker. 2026. Developing the German Medical Text Corpus (GeMTeX): Legal Compliance and Semantic Enrichment. International Conference on Language Resources and Evaluation, main:1571–1584.
Cite (Informal):
Developing the German Medical Text Corpus (GeMTeX): Legal Compliance and Semantic Enrichment (Hofenbitzer et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.122.pdf