Abstract
In this paper, we analyse the challenges of Chinese content scoring in comparison to English. As a review of prior work for Chinese content scoring shows a lack of open-access data in the field, we present two short-answer data sets for Chinese. The Chinese Educational Short Answers data set (CESA) contains 1800 student answers for five science-related questions. As a second data set, we collected ASAP-ZH with 942 answers by re-using three existing prompts from the ASAP data set. We adapt a state-of-the-art content scoring system for Chinese and evaluate it in several settings on these data sets. Results show that features on lower segmentation levels such as character n-grams tend to have better performance than features on token level.- Anthology ID:
- 2020.aacl-main.37
- Volume:
- Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing
- Month:
- December
- Year:
- 2020
- Address:
- Suzhou, China
- Editors:
- Kam-Fai Wong, Kevin Knight, Hua Wu
- Venue:
- AACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 347–357
- Language:
- URL:
- https://aclanthology.org/2020.aacl-main.37
- DOI:
- Cite (ACL):
- Yuning Ding, Andrea Horbach, and Torsten Zesch. 2020. Chinese Content Scoring: Open-Access Datasets and Features on Different Segmentation Levels. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 347–357, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- Chinese Content Scoring: Open-Access Datasets and Features on Different Segmentation Levels (Ding et al., AACL 2020)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-5/2020.aacl-main.37.pdf