Abstract
Optical character recognition (OCR) from newspaper page images is susceptible to noise due to degradation of old documents and variation in typesetting. In this report, we present a novel approach to OCR post-correction. We cast error correction as a translation task, and fine-tune BART, a transformer-based sequence-to-sequence language model pretrained to denoise corrupted text. We are the first to use sentence-level transformer models for OCR post-correction, and our best model achieves a 29.4% improvement in character accuracy over the original noisy OCR text. Our results demonstrate the utility of pretrained language models for dealing with noisy text.- Anthology ID:
- 2021.wnut-1.31
- Volume:
- Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)
- Month:
- November
- Year:
- 2021
- Address:
- Online
- Venue:
- WNUT
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 284–290
- Language:
- URL:
- https://aclanthology.org/2021.wnut-1.31
- DOI:
- 10.18653/v1/2021.wnut-1.31
- Cite (ACL):
- Elizabeth Soper, Stanley Fujimoto, and Yen-Yun Yu. 2021. BART for Post-Correction of OCR Newspaper Text. In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pages 284–290, Online. Association for Computational Linguistics.
- Cite (Informal):
- BART for Post-Correction of OCR Newspaper Text (Soper et al., WNUT 2021)
- PDF:
- https://preview.aclanthology.org/remove-xml-comments/2021.wnut-1.31.pdf