Abstract
Acquiring annotated corpora for medical NLP is challenging due to legal and privacy constraints and costly annotation efforts, and using annotated public datasets may do not align well to the desired target application in terms of annotation style or language. We investigate the approach of utilizing Wikipedia and WikiData jointly to acquire an unsupervised annotated corpus for named-entity recognition (NER). By controlling the annotation ruleset through WikiData’s ontology, we extract custom-defined annotations and dynamically impute weak annotations by an adaptive loss scaling. Our validation on German medication detection datasets yields competitive results. The entire pipeline only relies on open models and data resources, enabling reproducibility and open sharing of models and corpora. All relevant assets are shared on GitHub.- Anthology ID:
- 2024.bionlp-1.47
- Volume:
- Proceedings of the 23rd Workshop on Biomedical Natural Language Processing
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand
- Editors:
- Dina Demner-Fushman, Sophia Ananiadou, Makoto Miwa, Kirk Roberts, Junichi Tsujii
- Venues:
- BioNLP | WS
- SIG:
- SIGBIOMED
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 570–579
- Language:
- URL:
- https://aclanthology.org/2024.bionlp-1.47
- DOI:
- 10.18653/v1/2024.bionlp-1.47
- Cite (ACL):
- Johann Frei and Frank Kramer. 2024. Creating Ontology-annotated Corpora from Wikipedia for Medical Named-entity Recognition. In Proceedings of the 23rd Workshop on Biomedical Natural Language Processing, pages 570–579, Bangkok, Thailand. Association for Computational Linguistics.
- Cite (Informal):
- Creating Ontology-annotated Corpora from Wikipedia for Medical Named-entity Recognition (Frei & Kramer, BioNLP-WS 2024)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-5/2024.bionlp-1.47.pdf