Árni Magnússon
2023
Generating Errors: OCR Post-Processing for Icelandic
Atli Jasonarson
|
Steinþór Steingrímsson
|
Einar Sigurðsson
|
Árni Magnússon
|
Finnur Ingimundarson
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
We describe work on enhancing the performance of transformer-based encoder-decoder models for OCR post-correction on modern and historical Icelandic texts, where OCRed data are scarce. We trained six models, four from scratch and two fine-tuned versions of Google’s ByT5, on a combination of real data and texts populated with artificially generated errors. Our results show that the models trained from scratch, as opposed to the fine-tuned versions, benefited the most from the addition of artificially generated errors.
Search