BIOfid Dataset: Publishing a German Gold Standard for Named Entity Recognition in Historical Biodiversity Literature

Sajawel Ahmed, Manuel Stoeckel, Christine Driller, Adrian Pachzelt, Alexander Mehler


Abstract
The Specialized Information Service Biodiversity Research (BIOfid) has been launched to mobilize valuable biological data from printed literature hidden in German libraries for over the past 250 years. In this project, we annotate German texts converted by OCR from historical scientific literature on the biodiversity of plants, birds, moths and butterflies. Our work enables the automatic extraction of biological information previously buried in the mass of papers and volumes. For this purpose, we generated training data for the tasks of Named Entity Recognition (NER) and Taxa Recognition (TR) in biological documents. We use this data to train a number of leading machine learning tools and create a gold standard for TR in biodiversity literature. More specifically, we perform a practical analysis of our newly generated BIOfid dataset through various downstream-task evaluations and establish a new state of the art for TR with 80.23% F-score. In this sense, our paper lays the foundations for future work in the field of information extraction in biology texts.
Anthology ID:
K19-1081
Volume:
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)
Month:
November
Year:
2019
Address:
Hong Kong, China
Editors:
Mohit Bansal, Aline Villavicencio
Venue:
CoNLL
SIG:
SIGNLL
Publisher:
Association for Computational Linguistics
Note:
Pages:
871–880
Language:
URL:
https://aclanthology.org/K19-1081
DOI:
10.18653/v1/K19-1081
Bibkey:
Cite (ACL):
Sajawel Ahmed, Manuel Stoeckel, Christine Driller, Adrian Pachzelt, and Alexander Mehler. 2019. BIOfid Dataset: Publishing a German Gold Standard for Named Entity Recognition in Historical Biodiversity Literature. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 871–880, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):
BIOfid Dataset: Publishing a German Gold Standard for Named Entity Recognition in Historical Biodiversity Literature (Ahmed et al., CoNLL 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-3/K19-1081.pdf
Supplementary material:
 K19-1081.Supplementary_Material.zip