Lexically Aware Semi-Supervised Learning for OCR Post-Correction

Shruti Rijhwani; Daisy Rosenblum; Antonios Anastasopoulos; Graham Neubig

doi:10.1162/tacl_a_00427

Lexically Aware Semi-Supervised Learning for OCR Post-Correction

Shruti Rijhwani, Daisy Rosenblum, Antonios Anastasopoulos, Graham Neubig

Abstract

Much of the existing linguistic data in many languages of the world is locked away in non- digitized books and documents. Optical character recognition (OCR) can be used to produce digitized text, and previous work has demonstrated the utility of neural post-correction methods that improve the results of general- purpose OCR systems on recognition of less- well-resourced languages. However, these methods rely on manually curated post- correction data, which are relatively scarce compared to the non-annotated raw images that need to be digitized. In this paper, we present a semi-supervised learning method that makes it possible to utilize these raw images to improve performance, specifically through the use of self-training, a technique where a model is iteratively trained on its own outputs. In addition, to enforce consistency in the recognized vocabulary, we introduce a lexically aware decoding method that augments the neural post-correction model with a count-based language model constructed from the recognized texts, implemented using weighted finite-state automata (WFSA) for efficient and effective decoding. Results on four endangered languages demonstrate the utility of the proposed method, with relative error reductions of 15%–29%, where we find the combination of self-training and lexically aware decoding essential for achieving consistent improvements.1

Anthology ID:: 2021.tacl-1.76
Volume:: Transactions of the Association for Computational Linguistics, Volume 9
Month:
Year:: 2021
Address:: Cambridge, MA
Editors:: Brian Roark, Ani Nenkova
Venue:: TACL
SIG:
Publisher:: MIT Press
Note:
Pages:: 1285–1302
Language:
URL:: https://preview.aclanthology.org/build-pipeline-with-new-library/2021.tacl-1.76/
DOI:: 10.1162/tacl_a_00427
Bibkey:
Cite (ACL):: Shruti Rijhwani, Daisy Rosenblum, Antonios Anastasopoulos, and Graham Neubig. 2021. Lexically Aware Semi-Supervised Learning for OCR Post-Correction. Transactions of the Association for Computational Linguistics, 9:1285–1302.
Cite (Informal):: Lexically Aware Semi-Supervised Learning for OCR Post-Correction (Rijhwani et al., TACL 2021)
Copy Citation:
PDF:: https://preview.aclanthology.org/build-pipeline-with-new-library/2021.tacl-1.76.pdf
Video:: https://preview.aclanthology.org/build-pipeline-with-new-library/2021.tacl-1.76.mp4

PDF Search Video Fix metadata