Markus Kreuzthaler


Localising the Clinical Terminology SNOMED CT by Semi-automated Creation of a German Interface Vocabulary
Stefan Schulz | Larissa Hammer | David Hashemian-Nik | Markus Kreuzthaler
Proceedings of the LREC 2020 Workshop on Multilingual Biomedical Text Processing (MultilingualBIO 2020)

Medical language exhibits great variations regarding users, institutions and language registers. With large parts of clinical documents in free text, NLP is playing a more and more important role in unlocking re-usable and interoperable meaning from medical records. This study describes the architectural principles and the evolution of a German interface vocabulary, combining machine translation with human annotation and rule-based term generation, yielding a resource with 7.7 million raw entries, each of which linked to the reference terminology SNOMED CT, an international standard with about 350 thousand concepts. The purpose is to offer a high coverage of medical jargon in order to optimise terminology grounding of clinical texts by text mining systems. The core resource is a manually curated table of English-to-German word and chunk translations, supported by a set of language generation rules. The work describes a workflow consisting the enrichment and modification of this table with human and machine efforts, manually enriched by grammarspecific tags. Top-down and bottom-up methods for terminology population used in parallel. The final interface terms are produced by a term generator, which creates one-to-many German variants per SNOMED CT English description. Filtering against a large collection of domain terminologies and corpora drastically reduces the size of the vocabulary in favour of more realistic terms or terms that can reasonably be expected to match clinical text passages within a text-mining pipeline. An evaluation was performed by a comparison between the current version of the German interface vocabulary and the English description table of the SNOMED CT International release. An exact term matching was performed with a small parallel corpus constituted by text snippets from different clinical documents. With overall low retrieval parameters (with F-values around 30%), the performance of the German language scenario reaches 80 – 90% of the English one. Interestingly, annotations are slightly better with machine-translated (German – English) texts, using the International SNOMED CT resource only.


Unsupervised Abbreviation Detection in Clinical Narratives
Markus Kreuzthaler | Michel Oleynik | Alexander Avian | Stefan Schulz
Proceedings of the Clinical Natural Language Processing Workshop (ClinicalNLP)

Clinical narratives in electronic health record systems are a rich resource of patient-based information. They constitute an ongoing challenge for natural language processing, due to their high compactness and abundance of short forms. German medical texts exhibit numerous ad-hoc abbreviations that terminate with a period character. The disambiguation of period characters is therefore an important task for sentence and abbreviation detection. This task is addressed by a combination of co-occurrence information of word types with trailing period characters, a large domain dictionary, and a simple rule engine, thus merging statistical and dictionary-based disambiguation strategies. An F-measure of 0.95 could be reached by using the unsupervised approach presented in this paper. The results are promising for a domain-independent abbreviation detection strategy, because our approach avoids retraining of models or use case specific feature engineering efforts required for supervised machine learning approaches.


Disambiguation of Period Characters in Clinical Narratives
Markus Kreuzthaler | Stefan Schulz
Proceedings of the 5th International Workshop on Health Text Mining and Information Analysis (Louhi)