A Meta-dataset of German Medical Corpora: Harmonization of Annotations and Cross-corpus NER Evaluation

Ignacio Llorca, Florian Borchert, Matthieu-P. Schapranow


Abstract
Over the last years, an increasing number of publicly available, semantically annotated medical corpora have been released for the German language. While their annotations cover comparable semantic classes, the synergies of such efforts have not been explored, yet. This is due to substantial differences in the data schemas (syntax) and annotated entities (semantics), which hinder the creation of common meta-datasets. For instance, it is unclear whether named entity recognition (NER) taggers trained on one or more of such datasets are useful to detect entities in any of the other datasets. In this work, we create harmonized versions of German medical corpora using the BigBIO framework, and make them available to the community. Using these as a meta-dataset, we perform a series of cross-corpus evaluation experiments on two settings of aligned labels. These consist in fine-tuning various pre-trained Transformers on different combinations of training sets, and testing them against each dataset separately. We find that a) trained NER models generalize poorly, with F1 scores dropping approx. 20 pp. on unseen test data, and b) current pre-trained Transformer models for the German language do not systematically alleviate this issue. However, our results suggest that models benefit from additional training corpora in most cases, even if these belong to different medical fields or text genres.
Anthology ID:
2023.clinicalnlp-1.23
Volume:
Proceedings of the 5th Clinical Natural Language Processing Workshop
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Tristan Naumann, Asma Ben Abacha, Steven Bethard, Kirk Roberts, Anna Rumshisky
Venue:
ClinicalNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
171–181
Language:
URL:
https://aclanthology.org/2023.clinicalnlp-1.23
DOI:
10.18653/v1/2023.clinicalnlp-1.23
Bibkey:
Cite (ACL):
Ignacio Llorca, Florian Borchert, and Matthieu-P. Schapranow. 2023. A Meta-dataset of German Medical Corpora: Harmonization of Annotations and Cross-corpus NER Evaluation. In Proceedings of the 5th Clinical Natural Language Processing Workshop, pages 171–181, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
A Meta-dataset of German Medical Corpora: Harmonization of Annotations and Cross-corpus NER Evaluation (Llorca et al., ClinicalNLP 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl24-info/2023.clinicalnlp-1.23.pdf