Chinese WPLC: A Chinese Dataset for Evaluating Pretrained Language Models on Word Prediction Given Long-Range Context

Huibin Ge, Chenxi Sun, Deyi Xiong, Qun Liu


Abstract
This paper presents a Chinese dataset for evaluating pretrained language models on Word Prediction given Long-term Context (Chinese WPLC). We propose both automatic and manual selection strategies tailored to Chinese to guarantee that target words in passages collected from over 69K novels can only be predicted with long-term context beyond the scope of sentences containing the target words. Dataset analysis reveals that the types of target words range from common nouns to Chinese 4-character idioms. We also observe that linguistic relations between target words and long-range context exhibit diversity, including lexical match, synonym, summary and reasoning. Experiment results show that the Chinese pretrained language model PanGu-𝛼 is 45 points behind human in terms of top-1 word prediction accuracy, indicating that Chinese WPLC is a challenging dataset. The dataset is publicly available at https://git.openi.org.cn/PCL-Platform.Intelligence/Chinese_WPLC.
Anthology ID:
2021.emnlp-main.306
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3770–3778
Language:
URL:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2021.emnlp-main.306/
DOI:
10.18653/v1/2021.emnlp-main.306
Bibkey:
Cite (ACL):
Huibin Ge, Chenxi Sun, Deyi Xiong, and Qun Liu. 2021. Chinese WPLC: A Chinese Dataset for Evaluating Pretrained Language Models on Word Prediction Given Long-Range Context. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3770–3778, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Chinese WPLC: A Chinese Dataset for Evaluating Pretrained Language Models on Word Prediction Given Long-Range Context (Ge et al., EMNLP 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2021.emnlp-main.306.pdf
Video:
 https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2021.emnlp-main.306.mp4