A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Check

Dingmin Wang, Yan Song, Jing Li, Jialong Han, Haisong Zhang


Abstract
Chinese spelling check (CSC) is a challenging yet meaningful task, which not only serves as a preprocessing in many natural language processing(NLP) applications, but also facilitates reading and understanding of running texts in peoples’ daily lives. However, to utilize data-driven approaches for CSC, there is one major limitation that annotated corpora are not enough in applying algorithms and building models. In this paper, we propose a novel approach of constructing CSC corpus with automatically generated spelling errors, which are either visually or phonologically resembled characters, corresponding to the OCR- and ASR-based methods, respectively. Upon the constructed corpus, different models are trained and evaluated for CSC with respect to three standard test sets. Experimental results demonstrate the effectiveness of the corpus, therefore confirm the validity of our approach.
Anthology ID:
D18-1273
Volume:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Month:
October-November
Year:
2018
Address:
Brussels, Belgium
Editors:
Ellen Riloff, David Chiang, Julia Hockenmaier, Jun’ichi Tsujii
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
2517–2527
Language:
URL:
https://aclanthology.org/D18-1273
DOI:
10.18653/v1/D18-1273
Bibkey:
Cite (ACL):
Dingmin Wang, Yan Song, Jing Li, Jialong Han, and Haisong Zhang. 2018. A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Check. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2517–2527, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Check (Wang et al., EMNLP 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl-24-ws-corrections/D18-1273.pdf
Code
 wdimmy/Automatic-Corpus-Generation