On the Effect of Pretraining Corpora on In-context Learning by a Large-scale Language Model

Seongjin Shin; Sang-Woo Lee; Hwijeen Ahn; Sungdong Kim; HyoungSeok Kim; Boseop Kim; Kyunghyun Cho; Gichang Lee; Woomyoung Park; Jung-Woo Ha; Nako Sung

doi:10.18653/v1/2022.naacl-main.380

On the Effect of Pretraining Corpora on In-context Learning by a Large-scale Language Model

Seongjin Shin, Sang-Woo Lee, Hwijeen Ahn, Sungdong Kim, HyoungSeok Kim, Boseop Kim, Kyunghyun Cho, Gichang Lee, Woomyoung Park, Jung-Woo Ha, Nako Sung

Abstract

Many recent studies on large-scale language models have reported successful in-context zero- and few-shot learning ability. However, the in-depth analysis of when in-context learning occurs is still lacking. For example, it is unknown how in-context learning performance changes as the training corpus varies. Here, we investigate the effects of the source and size of the pretraining corpus on in-context learning in HyperCLOVA, a Korean-centric GPT-3 model. From our in-depth investigation, we introduce the following observations: (1) in-context learning performance heavily depends on the corpus domain source, and the size of the pretraining corpus does not necessarily determine the emergence of in-context learning, (2) in-context learning ability can emerge when a language model is trained on a combination of multiple corpora, even when each corpus does not result in in-context learning on its own, (3) pretraining with a corpus related to a downstream task does not always guarantee the competitive in-context learning performance of the downstream task, especially in the few-shot setting, and (4) the relationship between language modeling (measured in perplexity) and in-context learning does not always correlate: e.g., low perplexity does not always imply high in-context few-shot learning performance.

Anthology ID:: 2022.naacl-main.380
Volume:: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:: July
Year:: 2022
Address:: Seattle, United States
Editors:: Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5168–5186
Language:
URL:: https://aclanthology.org/2022.naacl-main.380
DOI:: 10.18653/v1/2022.naacl-main.380
Bibkey:
Cite (ACL):: Seongjin Shin, Sang-Woo Lee, Hwijeen Ahn, Sungdong Kim, HyoungSeok Kim, Boseop Kim, Kyunghyun Cho, Gichang Lee, Woomyoung Park, Jung-Woo Ha, and Nako Sung. 2022. On the Effect of Pretraining Corpora on In-context Learning by a Large-scale Language Model. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5168–5186, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):: On the Effect of Pretraining Corpora on In-context Learning by a Large-scale Language Model (Shin et al., NAACL 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-5/2022.naacl-main.380.pdf
Video:: https://preview.aclanthology.org/nschneid-patch-5/2022.naacl-main.380.mp4
Data: KorQuAD

PDF Search Video