UMRSpell: Unifying the Detection and Correction Parts of Pre-trained Models towards Chinese Missing, Redundant, and Spelling Correction

Zheyu He, Yujin Zhu, Linlin Wang, Liang Xu


Abstract
Chinese Spelling Correction (CSC) is the task of detecting and correcting misspelled charac- ters in Chinese texts. As an important step for various downstream tasks, CSC confronts two challenges: 1) Character-level errors consist not only of spelling errors but also of missing and redundant ones that cause variable length between input and output texts, for which most CSC methods could not handle well because of the consistence length of texts required by their inherent detection-correction framework. Con- sequently, the two errors are considered out- side the scope and left to future work, despite the fact that they are widely found and bound to CSC task in Chinese industrial scenario, such as Automatic Speech Recognition (ASR) and Optical Character Recognition (OCR). 2) Most existing CSC methods focus on either detector or corrector and train different mod- els for each one, respectively, leading to in- sufficiency of parameters sharing. To address these issues, we propose a novel model UMR- Spell to learn detection and correction parts together at the same time from a multi-task learning perspective by using a detection trans- mission self-attention matrix, and flexibly deal with both missing, redundant, and spelling er- rors through re-tagging rules. Furthermore, we build a new dataset ECMR-2023 containing five kinds of character-level errors to enrich the CSC task closer to real-world applications. Ex- periments on both SIGHAN benchmarks and ECMR-2023 demonstrate the significant effec- tiveness of UMRSpell over previous represen- tative baselines.
Anthology ID:
2023.acl-long.570
Volume:
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10238–10250
Language:
URL:
https://aclanthology.org/2023.acl-long.570
DOI:
10.18653/v1/2023.acl-long.570
Bibkey:
Cite (ACL):
Zheyu He, Yujin Zhu, Linlin Wang, and Liang Xu. 2023. UMRSpell: Unifying the Detection and Correction Parts of Pre-trained Models towards Chinese Missing, Redundant, and Spelling Correction. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10238–10250, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
UMRSpell: Unifying the Detection and Correction Parts of Pre-trained Models towards Chinese Missing, Redundant, and Spelling Correction (He et al., ACL 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-2023-videos/2023.acl-long.570.pdf