ProGene - A Large-scale, High-Quality Protein-Gene Annotated Benchmark Corpus

Erik Faessler, Luise Modersohn, Christina Lohr, Udo Hahn


Abstract
Genes and proteins constitute the fundamental entities of molecular genetics. We here introduce ProGene (formerly called FSU-PRGE), a corpus that reflects our efforts to cope with this important class of named entities within the framework of a long-lasting large-scale annotation campaign at the Jena University Language & Information Engineering (JULIE) Lab. We assembled the entire corpus from 11 subcorpora covering various biological domains to achieve an overall subdomain-independent corpus. It consists of 3,308 MEDLINE abstracts with over 36k sentences and more than 960k tokens annotated with nearly 60k named entity mentions. Two annotators strove for carefully assigning entity mentions to classes of genes/proteins as well as families/groups, complexes, variants and enumerations of those where genes and proteins are represented by a single class. The main purpose of the corpus is to provide a large body of consistent and reliable annotations for supervised training and evaluation of machine learning algorithms in this relevant domain. Furthermore, we provide an evaluation of two state-of-the-art baseline systems — BioBert and flair — on the ProGene corpus. We make the evaluation datasets and the trained models available to encourage comparable evaluations of new methods in the future.
Anthology ID:
2020.lrec-1.564
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4585–4596
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.564
DOI:
Bibkey:
Cite (ACL):
Erik Faessler, Luise Modersohn, Christina Lohr, and Udo Hahn. 2020. ProGene - A Large-scale, High-Quality Protein-Gene Annotated Benchmark Corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4585–4596, Marseille, France. European Language Resources Association.
Cite (Informal):
ProGene - A Large-scale, High-Quality Protein-Gene Annotated Benchmark Corpus (Faessler et al., LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/remove-xml-comments/2020.lrec-1.564.pdf