Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media

Xiang Dai, Sarvnaz Karimi, Ben Hachey, Cecile Paris


Abstract
Recent studies on domain-specific BERT models show that effectiveness on downstream tasks can be improved when models are pretrained on in-domain data. Often, the pretraining data used in these models are selected based on their subject matter, e.g., biology or computer science. Given the range of applications using social media text, and its unique language variety, we pretrain two models on tweets and forum text respectively, and empirically demonstrate the effectiveness of these two resources. In addition, we investigate how similarity measures can be used to nominate in-domain pretraining data. We publicly release our pretrained models at https://bit.ly/35RpTf0.
Anthology ID:
2020.findings-emnlp.151
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2020
Month:
November
Year:
2020
Address:
Online
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1675–1681
Language:
URL:
https://aclanthology.org/2020.findings-emnlp.151
DOI:
10.18653/v1/2020.findings-emnlp.151
Bibkey:
Cite (ACL):
Xiang Dai, Sarvnaz Karimi, Ben Hachey, and Cecile Paris. 2020. Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1675–1681, Online. Association for Computational Linguistics.
Cite (Informal):
Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media (Dai et al., Findings 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2020.findings-emnlp.151.pdf
Data
2010 i2b2/VAGLUESST