SpellBERT: A Lightweight Pretrained Model for Chinese Spelling Check

Tuo Ji, Hang Yan, Xipeng Qiu


Abstract
Chinese Spelling Check (CSC) is to detect and correct Chinese spelling errors. Many models utilize a predefined confusion set to learn a mapping between correct characters and its visually similar or phonetically similar misuses but the mapping may be out-of-domain. To that end, we propose SpellBERT, a pretrained model with graph-based extra features and independent on confusion set. To explicitly capture the two erroneous patterns, we employ a graph neural network to introduce radical and pinyin information as visual and phonetic features. For better fusing these features with character representations, we devise masked language model alike pre-training tasks. With this feature-rich pre-training, SpellBERT with only half size of BERT can show competitive performance and make a state-of-the-art result on the OCR dataset where most of the errors are not covered by the existing confusion set.
Anthology ID:
2021.emnlp-main.287
Original:
2021.emnlp-main.287v1
Version 2:
2021.emnlp-main.287v2
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3544–3551
Language:
URL:
https://aclanthology.org/2021.emnlp-main.287
DOI:
10.18653/v1/2021.emnlp-main.287
Bibkey:
Cite (ACL):
Tuo Ji, Hang Yan, and Xipeng Qiu. 2021. SpellBERT: A Lightweight Pretrained Model for Chinese Spelling Check. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3544–3551, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
SpellBERT: A Lightweight Pretrained Model for Chinese Spelling Check (Ji et al., EMNLP 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2021.emnlp-main.287.pdf
Code
 benbijituo/spellbert