Correcting Whitespace Errors in Digitized Historical Texts
Abstract
Whitespace errors are common to digitized archives. This paper describes a lightweight unsupervised technique for recovering the original whitespace. Our approach is based on count statistics from Google n-grams, which are converted into a likelihood ratio test computed from interpolated trigram and bigram probabilities. To evaluate this approach, we annotate a small corpus of whitespace errors in a digitized corpus of newspapers from the 19th century United States. Our technique identifies and corrects most whitespace errors while introducing a minimal amount of oversegmentation: it achieves 77% recall at a false positive rate of less than 1%, and 91% recall at a false positive rate of less than 3%.- Anthology ID:
- W19-2513
- Volume:
- Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
- Month:
- June
- Year:
- 2019
- Address:
- Minneapolis, USA
- Editors:
- Beatrice Alex, Stefania Degaetano-Ortlieb, Anna Kazantseva, Nils Reiter, Stan Szpakowicz
- Venue:
- LaTeCH
- SIG:
- SIGHUM
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 98–103
- Language:
- URL:
- https://aclanthology.org/W19-2513
- DOI:
- 10.18653/v1/W19-2513
- Cite (ACL):
- Sandeep Soni, Lauren Klein, and Jacob Eisenstein. 2019. Correcting Whitespace Errors in Digitized Historical Texts. In Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 98–103, Minneapolis, USA. Association for Computational Linguistics.
- Cite (Informal):
- Correcting Whitespace Errors in Digitized Historical Texts (Soni et al., LaTeCH 2019)
- PDF:
- https://preview.aclanthology.org/teach-a-man-to-fish/W19-2513.pdf
- Code
- sandeepsoni/whitespace-normalizer