Fast Whitespace Correction with Encoder-Only Transformers

Hannah Bast, Matthias Hertel, Sebastian Walter


Abstract
The goal of whitespace correction is to fix space errors in arbitrary given text. For example, given the text “whi te space correctio nwithTransf or mers”, produce “whitespace correction with Transformers”. We compare two Transformer-based models, a character-level encoder-decoder model and a byte-level encoder-only model. We find that the encoder-only model is both faster and achieves higher quality. We provide an easy-to-use tool that is over 900 times faster than the previous best tool, with the same high quality. Our tool repairs text at a rate of over 200 kB/s on GPU, with a sequence-averaged F1-score ranging from 87.5% for hard-to-correct text up to 99% for text without any spaces.
Anthology ID:
2023.acl-demo.37
Volume:
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Danushka Bollegala, Ruihong Huang, Alan Ritter
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
389–399
Language:
URL:
https://aclanthology.org/2023.acl-demo.37
DOI:
10.18653/v1/2023.acl-demo.37
Bibkey:
Cite (ACL):
Hannah Bast, Matthias Hertel, and Sebastian Walter. 2023. Fast Whitespace Correction with Encoder-Only Transformers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 389–399, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Fast Whitespace Correction with Encoder-Only Transformers (Bast et al., ACL 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2023.acl-demo.37.pdf
Video:
 https://preview.aclanthology.org/emnlp-22-attachments/2023.acl-demo.37.mp4