Frank Kramer

2024

pdf abs
Creating Ontology-annotated Corpora from Wikipedia for Medical Named-entity Recognition
Johann Frei | Frank Kramer
Proceedings of the 23rd Workshop on Biomedical Natural Language Processing

Acquiring annotated corpora for medical NLP is challenging due to legal and privacy constraints and costly annotation efforts, and using annotated public datasets may do not align well to the desired target application in terms of annotation style or language. We investigate the approach of utilizing Wikipedia and WikiData jointly to acquire an unsupervised annotated corpus for named-entity recognition (NER). By controlling the annotation ruleset through WikiData’s ontology, we extract custom-defined annotations and dynamically impute weak annotations by an adaptive loss scaling. Our validation on German medication detection datasets yields competitive results. The entire pipeline only relies on open models and data resources, enabling reproducibility and open sharing of models and corpora. All relevant assets are shared on GitHub.

Co-authors

Johann Frei 1

Venues

bionlp1
ws1