Abstract
Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained language model based on BERT (Devlin et. al., 2018) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-the-art results on several of these tasks. The code and pretrained models are available at https://github.com/allenai/scibert/.- Anthology ID:
- D19-1371
- Volume:
- Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
- Month:
- November
- Year:
- 2019
- Address:
- Hong Kong, China
- Editors:
- Kentaro Inui, Jing Jiang, Vincent Ng, Xiaojun Wan
- Venues:
- EMNLP | IJCNLP
- SIG:
- SIGDAT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 3615–3620
- Language:
- URL:
- https://preview.aclanthology.org/add_missing_videos/D19-1371/
- DOI:
- 10.18653/v1/D19-1371
- Cite (ACL):
- Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620, Hong Kong, China. Association for Computational Linguistics.
- Cite (Informal):
- SciBERT: A Pretrained Language Model for Scientific Text (Beltagy et al., EMNLP-IJCNLP 2019)
- PDF:
- https://preview.aclanthology.org/add_missing_videos/D19-1371.pdf
- Code
- allenai/scibert + additional community code
- Data
- ACL ARC, BC5CDR, ChemProt, EBM-NLP, GENIA, JNLPBA, Microsoft Academic Graph, NCBI Disease, Paper Field, Pubmed, SciCite, SciERC