Abstract
A great deal of historical corpora suffer from errors introduced by the OCR (optical character recognition) methods used in the digitization process. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We present a fully automatic unsupervised way of extracting parallel data for training a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction.- Anthology ID:
- R19-1051
- Volume:
- Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
- Month:
- September
- Year:
- 2019
- Address:
- Varna, Bulgaria
- Editors:
- Ruslan Mitkov, Galia Angelova
- Venue:
- RANLP
- SIG:
- Publisher:
- INCOMA Ltd.
- Note:
- Pages:
- 431–436
- Language:
- URL:
- https://preview.aclanthology.org/build-pipeline-with-new-library/R19-1051/
- DOI:
- 10.26615/978-954-452-056-4_051
- Cite (ACL):
- Mika Hämäläinen and Simon Hengchen. 2019. From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 431–436, Varna, Bulgaria. INCOMA Ltd..
- Cite (Informal):
- From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction (Hämäläinen & Hengchen, RANLP 2019)
- PDF:
- https://preview.aclanthology.org/build-pipeline-with-new-library/R19-1051.pdf
- Code
- mikahama/natas