Chinese Content Scoring: Open-Access Datasets and Features on Different Segmentation Levels

Yuning Ding, Andrea Horbach, Torsten Zesch


Abstract
In this paper, we analyse the challenges of Chinese content scoring in comparison to English. As a review of prior work for Chinese content scoring shows a lack of open-access data in the field, we present two short-answer data sets for Chinese. The Chinese Educational Short Answers data set (CESA) contains 1800 student answers for five science-related questions. As a second data set, we collected ASAP-ZH with 942 answers by re-using three existing prompts from the ASAP data set. We adapt a state-of-the-art content scoring system for Chinese and evaluate it in several settings on these data sets. Results show that features on lower segmentation levels such as character n-grams tend to have better performance than features on token level.
Anthology ID:
2020.aacl-main.37
Volume:
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing
Month:
December
Year:
2020
Address:
Suzhou, China
Editors:
Kam-Fai Wong, Kevin Knight, Hua Wu
Venue:
AACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
347–357
Language:
URL:
https://aclanthology.org/2020.aacl-main.37
DOI:
Bibkey:
Cite (ACL):
Yuning Ding, Andrea Horbach, and Torsten Zesch. 2020. Chinese Content Scoring: Open-Access Datasets and Features on Different Segmentation Levels. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 347–357, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Chinese Content Scoring: Open-Access Datasets and Features on Different Segmentation Levels (Ding et al., AACL 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-5/2020.aacl-main.37.pdf
Dataset:
 2020.aacl-main.37.Dataset.zip
Software:
 2020.aacl-main.37.Software.zip