Tri-Train: Automatic Pre-Fine Tuning between Pre-Training and Fine-Tuning for SciNER

Qingkai Zeng, Wenhao Yu, Mengxia Yu, Tianwen Jiang, Tim Weninger, Meng Jiang


Abstract
The training process of scientific NER models is commonly performed in two steps: i) Pre-training a language model by self-supervised tasks on huge data and ii) fine-tune training with small labelled data. The success of the strategy depends on the relevance between the data domains and between the tasks. However, gaps are found in practice when the target domains are specific and small. We propose a novel framework to introduce a “pre-fine tuning” step between pre-training and fine-tuning. It constructs a corpus by selecting sentences from unlabeled documents that are the most relevant with the labelled training data. Instead of predicting tokens in random spans, the pre-fine tuning task is to predict tokens in entity candidates identified by text mining methods. Pre-fine tuning is automatic and light-weight because the corpus size can be much smaller than pre-training data to achieve a better performance. Experiments on seven benchmarks demonstrate the effectiveness.
Anthology ID:
2020.findings-emnlp.429
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2020
Month:
November
Year:
2020
Address:
Online
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4778–4787
Language:
URL:
https://aclanthology.org/2020.findings-emnlp.429
DOI:
10.18653/v1/2020.findings-emnlp.429
Bibkey:
Cite (ACL):
Qingkai Zeng, Wenhao Yu, Mengxia Yu, Tianwen Jiang, Tim Weninger, and Meng Jiang. 2020. Tri-Train: Automatic Pre-Fine Tuning between Pre-Training and Fine-Tuning for SciNER. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4778–4787, Online. Association for Computational Linguistics.
Cite (Informal):
Tri-Train: Automatic Pre-Fine Tuning between Pre-Training and Fine-Tuning for SciNER (Zeng et al., Findings 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2020.findings-emnlp.429.pdf
Optional supplementary material:
 2020.findings-emnlp.429.OptionalSupplementaryMaterial.zip
Data
SciERC