NLPSharedTasks: A Corpus of Shared Task Overview Papers in Natural Language Processing Domains

Anna Martin, Ted Pedersen, Jennifer D’Souza


Abstract
As the rate of scientific output continues to grow, it is increasingly important to develop systems to improve interfaces between researchers and scholarly papers. Training models to extract scientific information from the full texts of scholarly documents is important for improving how we structure and access scientific information. However, there are few annotated corpora that provide full paper texts. This paper presents the NLPSharedTasks corpus, a new resource of 254 full text Shared Task Overview papers in NLP domains with annotated task descriptions. We calculated strict and relaxed inter-annotator agreement scores, achieving Cohen’s kappa coefficients of 0.44 and 0.95, respectively. Lastly, we performed a sentence classification task over the dataset, in order to generate a neural baseline for future research and to provide an example of how to preprocess unbalanced datasets of full scientific texts. We achieved an F1 score of 0.75 using SciBERT, fine-tuned and tested on a rebalanced version of the dataset.
Anthology ID:
2022.wiesp-1.13
Volume:
Proceedings of the first Workshop on Information Extraction from Scientific Publications
Month:
November
Year:
2022
Address:
Online
Venue:
WIESP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
105–120
Language:
URL:
https://aclanthology.org/2022.wiesp-1.13
DOI:
Bibkey:
Cite (ACL):
Anna Martin, Ted Pedersen, and Jennifer D’Souza. 2022. NLPSharedTasks: A Corpus of Shared Task Overview Papers in Natural Language Processing Domains. In Proceedings of the first Workshop on Information Extraction from Scientific Publications, pages 105–120, Online. Association for Computational Linguistics.
Cite (Informal):
NLPSharedTasks: A Corpus of Shared Task Overview Papers in Natural Language Processing Domains (Martin et al., WIESP 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2022.wiesp-1.13.pdf