Abstract
We propose a novel approach to OCR post-correction that exploits repeated texts in large corpora both as a source of noisy target outputs for unsupervised training and as a source of evidence when decoding. A sequence-to-sequence model with attention is applied for single-input correction, and a new decoder with multi-input attention averaging is developed to search for consensus among multiple sequences. We design two ways of training the correction model without human annotation, either training to match noisily observed textual variants or bootstrapping from a uniform error model. On two corpora of historical newspapers and books, we show that these unsupervised techniques cut the character and word error rates nearly in half on single inputs and, with the addition of multi-input decoding, can rival supervised methods.- Anthology ID:
- P18-1220
- Volume:
- Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- July
- Year:
- 2018
- Address:
- Melbourne, Australia
- Editors:
- Iryna Gurevych, Yusuke Miyao
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 2363–2372
- Language:
- URL:
- https://preview.aclanthology.org/remove-affiliations/P18-1220/
- DOI:
- 10.18653/v1/P18-1220
- Cite (ACL):
- Rui Dong and David Smith. 2018. Multi-Input Attention for Unsupervised OCR Correction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2363–2372, Melbourne, Australia. Association for Computational Linguistics.
- Cite (Informal):
- Multi-Input Attention for Unsupervised OCR Correction (Dong & Smith, ACL 2018)
- PDF:
- https://preview.aclanthology.org/remove-affiliations/P18-1220.pdf
- Data
- New York Times Annotated Corpus