@inproceedings{ge-etal-2021-chinese,
    title = "{C}hinese {WPLC}: A {C}hinese Dataset for Evaluating Pretrained Language Models on Word Prediction Given Long-Range Context",
    author = "Ge, Huibin  and
      Sun, Chenxi  and
      Xiong, Deyi  and
      Liu, Qun",
    editor = "Moens, Marie-Francine  and
      Huang, Xuanjing  and
      Specia, Lucia  and
      Yih, Scott Wen-tau",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://preview.aclanthology.org/ingest-emnlp/2021.emnlp-main.306/",
    doi = "10.18653/v1/2021.emnlp-main.306",
    pages = "3770--3778",
    abstract = "This paper presents a Chinese dataset for evaluating pretrained language models on Word Prediction given Long-term Context (Chinese WPLC). We propose both automatic and manual selection strategies tailored to Chinese to guarantee that target words in passages collected from over 69K novels can only be predicted with long-term context beyond the scope of sentences containing the target words. Dataset analysis reveals that the types of target words range from common nouns to Chinese 4-character idioms. We also observe that linguistic relations between target words and long-range context exhibit diversity, including lexical match, synonym, summary and reasoning. Experiment results show that the Chinese pretrained language model PanGu-$\alpha$ is 45 points behind human in terms of top-1 word prediction accuracy, indicating that Chinese WPLC is a challenging dataset. The dataset is publicly available at \url{https://git.openi.org.cn/PCL-Platform.Intelligence/Chinese_WPLC}."
}Markdown (Informal)
[Chinese WPLC: A Chinese Dataset for Evaluating Pretrained Language Models on Word Prediction Given Long-Range Context](https://preview.aclanthology.org/ingest-emnlp/2021.emnlp-main.306/) (Ge et al., EMNLP 2021)
ACL