From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction

Mika Hämäläinen; Simon Hengchen

doi:10.26615/978-954-452-056-4_051

From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction

Abstract

A great deal of historical corpora suffer from errors introduced by the OCR (optical character recognition) methods used in the digitization process. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We present a fully automatic unsupervised way of extracting parallel data for training a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction.

Anthology ID:: R19-1051
Volume:: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
Month:: September
Year:: 2019
Address:: Varna, Bulgaria
Editors:: Ruslan Mitkov, Galia Angelova
Venue:: RANLP
SIG:
Publisher:: INCOMA Ltd.
Note:
Pages:: 431–436
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/R19-1051/
DOI:: 10.26615/978-954-452-056-4_051
Bibkey:
Cite (ACL):: Mika Hämäläinen and Simon Hengchen. 2019. From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 431–436, Varna, Bulgaria. INCOMA Ltd..
Cite (Informal):: From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction (Hämäläinen & Hengchen, RANLP 2019)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/R19-1051.pdf
Code: mikahama/natas

PDF Cite Search Code Fix data