Abstract
Named entity recognition identifies common classes of entities in text, but these entity labels are generally sparse, limiting utility to downstream tasks. In this work we present ScienceExamCER, a densely-labeled semantic classification corpus of 133k mentions in the science exam domain where nearly all (96%) of content words have been annotated with one or more fine-grained semantic class labels including taxonomic groups, meronym groups, verb/action groups, properties and values, and synonyms. Semantic class labels are drawn from a manually-constructed fine-grained typology of 601 classes generated through a data-driven analysis of 4,239 science exam questions. We show an off-the-shelf BERT-based named entity recognition model modified for multi-label classification achieves an accuracy of 0.85 F1 on this task, suggesting strong utility for downstream tasks in science domain question answering requiring densely-labeled semantic classification.- Anthology ID:
- 2020.lrec-1.558
- Volume:
- Proceedings of the Twelfth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2020
- Address:
- Marseille, France
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 4529–4546
- Language:
- English
- URL:
- https://aclanthology.org/2020.lrec-1.558
- DOI:
- Cite (ACL):
- Hannah Smith, Zeyu Zhang, John Culnan, and Peter Jansen. 2020. ScienceExamCER: A High-Density Fine-Grained Science-Domain Corpus for Common Entity Recognition. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4529–4546, Marseille, France. European Language Resources Association.
- Cite (Informal):
- ScienceExamCER: A High-Density Fine-Grained Science-Domain Corpus for Common Entity Recognition (Smith et al., LREC 2020)
- PDF:
- https://preview.aclanthology.org/remove-xml-comments/2020.lrec-1.558.pdf
- Data
- ScienceExamCER, ARC, CoNLL-2003