Creation and evaluation of a dictionary-based tagger for virus species and proteins
Helen Cook, Rūdolfs Bērziņš, Cristina Leal Rodrıguez, Juan Miguel Cejuela, Lars Juhl Jensen
Abstract
ext mining automatically extracts information from the literature with the goal of making it available for further analysis, for example by incorporating it into biomedical databases. A key first step towards this goal is to identify and normalize the named entities, such as proteins and species, which are mentioned in text. Despite the large detrimental impact that viruses have on human and agricultural health, very little previous text-mining work has focused on identifying virus species and proteins in the literature. Here, we present an improved dictionary-based system for viral species and the first dictionary for viral proteins, which we benchmark on a new corpus of 300 manually annotated abstracts. We achieve 81.0% precision and 72.7% recall at the task of recognizing and normalizing viral species and 76.2% precision and 34.9% recall on viral proteins. These results are achieved despite the many challenges involved with the names of viral species and, especially, proteins. This work provides a foundation that can be used to extract more complicated relations about viruses from the literature.- Anthology ID:
- W17-2311
- Volume:
- BioNLP 2017
- Month:
- August
- Year:
- 2017
- Address:
- Vancouver, Canada,
- Editors:
- Kevin Bretonnel Cohen, Dina Demner-Fushman, Sophia Ananiadou, Junichi Tsujii
- Venue:
- BioNLP
- SIG:
- SIGBIOMED
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 91–98
- Language:
- URL:
- https://aclanthology.org/W17-2311
- DOI:
- 10.18653/v1/W17-2311
- Cite (ACL):
- Helen Cook, Rūdolfs Bērziņš, Cristina Leal Rodrıguez, Juan Miguel Cejuela, and Lars Juhl Jensen. 2017. Creation and evaluation of a dictionary-based tagger for virus species and proteins. In BioNLP 2017, pages 91–98, Vancouver, Canada,. Association for Computational Linguistics.
- Cite (Informal):
- Creation and evaluation of a dictionary-based tagger for virus species and proteins (Cook et al., BioNLP 2017)
- PDF:
- https://preview.aclanthology.org/ml4al-ingestion/W17-2311.pdf