ChID: A Large-scale Chinese IDiom Dataset for Cloze Test

Chujie Zheng, Minlie Huang, Aixin Sun


Abstract
Cloze-style reading comprehension in Chinese is still limited due to the lack of various corpora. In this paper we propose a large-scale Chinese cloze test dataset ChID, which studies the comprehension of idiom, a unique language phenomenon in Chinese. In this corpus, the idioms in a passage are replaced by blank symbols and the correct answer needs to be chosen from well-designed candidate idioms. We carefully study how the design of candidate idioms and the representation of idioms affect the performance of state-of-the-art models. Results show that the machine accuracy is substantially worse than that of human, indicating a large space for further research.
Anthology ID:
P19-1075
Volume:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2019
Address:
Florence, Italy
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
778–787
Language:
URL:
https://aclanthology.org/P19-1075
DOI:
10.18653/v1/P19-1075
Bibkey:
Cite (ACL):
Chujie Zheng, Minlie Huang, and Aixin Sun. 2019. ChID: A Large-scale Chinese IDiom Dataset for Cloze Test. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 778–787, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
ChID: A Large-scale Chinese IDiom Dataset for Cloze Test (Zheng et al., ACL 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/update-css-js/P19-1075.pdf
Code
 zhengcj1/ChID-Dataset +  additional community code
Data
ChIDCBTCLOTHCMRC 2017LAMBADAWho-did-What