Self-training Improves Pre-training for Natural Language Understanding

Jingfei Du; Édouard Grave; Beliz Gunel; Vishrav Chaudhary; Onur Çelebi; Michael Auli; Veselin Stoyanov; Alexis Conneau

doi:10.18653/v1/2021.naacl-main.426

Self-training Improves Pre-training for Natural Language Understanding

Jingfei Du, Edouard Grave, Beliz Gunel, Vishrav Chaudhary, Onur Celebi, Michael Auli, Veselin Stoyanov, Alexis Conneau

Abstract

Unsupervised pre-training has led to much recent progress in natural language understanding. In this paper, we study self-training as another way to leverage unlabeled data through semi-supervised learning. To obtain additional data for a specific task, we introduce SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data to retrieve sentences from a bank of billions of unlabeled sentences crawled from the web. Unlike previous semi-supervised methods, our approach does not require in-domain unlabeled data and is therefore more generally applicable. Experiments show that self-training is complementary to strong RoBERTa baselines on a variety of tasks. Our augmentation approach leads to scalable and effective self-training with improvements of up to 2.6% on standard text classification benchmarks. Finally, we also show strong gains on knowledge-distillation and few-shot learning.

Anthology ID:: 2021.naacl-main.426
Volume:: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:: June
Year:: 2021
Address:: Online
Editors:: Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, Yichao Zhou
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5408–5418
Language:
URL:: https://aclanthology.org/2021.naacl-main.426
DOI:: 10.18653/v1/2021.naacl-main.426
Bibkey:
Cite (ACL):: Jingfei Du, Edouard Grave, Beliz Gunel, Vishrav Chaudhary, Onur Celebi, Michael Auli, Veselin Stoyanov, and Alexis Conneau. 2021. Self-training Improves Pre-training for Natural Language Understanding. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5408–5418, Online. Association for Computational Linguistics.
Cite (Informal):: Self-training Improves Pre-training for Natural Language Understanding (Du et al., NAACL 2021)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-3/2021.naacl-main.426.pdf
Code: facebookresearch/SentAugment
Data: SST, SST-2, SST-5

PDF Search Code