Abstract
State-of-the-art approaches to spelling error correction problem include Transformer-based Seq2Seq models, which require large training sets and suffer from slow inference time; and sequence labeling models based on Transformer encoders like BERT, which involve token-level label space and therefore a large pre-defined vocabulary dictionary. In this paper we present a Hierarchical Character Tagger model, or HCTagger, for short text spelling error correction. We use a pre-trained language model at the character level as a text encoder, and then predict character-level edits to transform the original text into its error-free form with a much smaller label space. For decoding, we propose a hierarchical multi-task approach to alleviate the issue of long-tail label distribution without introducing extra model parameters. Experiments on two public misspelling correction datasets demonstrate that HCTagger is an accurate and much faster approach than many existing models.- Anthology ID:
- 2021.wnut-1.13
- Volume:
- Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)
- Month:
- November
- Year:
- 2021
- Address:
- Online
- Venue:
- WNUT
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 106–113
- Language:
- URL:
- https://aclanthology.org/2021.wnut-1.13
- DOI:
- 10.18653/v1/2021.wnut-1.13
- Cite (ACL):
- Mengyi Gao, Canran Xu, and Peng Shi. 2021. Hierarchical Character Tagger for Short Text Spelling Error Correction. In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pages 106–113, Online. Association for Computational Linguistics.
- Cite (Informal):
- Hierarchical Character Tagger for Short Text Spelling Error Correction (Gao et al., WNUT 2021)
- PDF:
- https://preview.aclanthology.org/paclic-22-ingestion/2021.wnut-1.13.pdf