CCTC: A Cross-Sentence Chinese Text Correction Dataset for Native Speakers

Baoxin Wang, Xingyi Duan, Dayong Wu, Wanxiang Che, Zhigang Chen, Guoping Hu


Abstract
The Chinese text correction (CTC) focuses on detecting and correcting Chinese spelling errors and grammatical errors. Most existing datasets of Chinese spelling check (CSC) and Chinese grammatical error correction (GEC) are focused on a single sentence written by Chinese-as-a-second-language (CSL) learners. We find that errors caused by native speakers differ significantly from those produced by non-native speakers. These differences make it inappropriate to use the existing test sets directly to evaluate text correction systems for native speakers. Some errors also require the cross-sentence information to be identified and corrected. In this paper, we propose a cross-sentence Chinese text correction dataset for native speakers. Concretely, we manually annotated 1,500 texts written by native speakers. The dataset consists of 30,811 sentences and more than 1,000,000 Chinese characters. It contains four types of errors: spelling errors, redundant words, missing words, and word ordering errors. We also test some state-of-the-art models on the dataset. The experimental results show that even the model with the best performance is 20 points lower than humans, which indicates that there is still much room for improvement. We hope that the new dataset can fill the gap in cross-sentence text correction for native Chinese speakers.
Anthology ID:
2022.coling-1.294
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
3331–3341
Language:
URL:
https://aclanthology.org/2022.coling-1.294
DOI:
Bibkey:
Cite (ACL):
Baoxin Wang, Xingyi Duan, Dayong Wu, Wanxiang Che, Zhigang Chen, and Guoping Hu. 2022. CCTC: A Cross-Sentence Chinese Text Correction Dataset for Native Speakers. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3331–3341, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
CCTC: A Cross-Sentence Chinese Text Correction Dataset for Native Speakers (Wang et al., COLING 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2022.coling-1.294.pdf
Code
 destwang/ctcresources
Data
JFLEGMuCGEC