Generating Errors: OCR Post-Processing for Icelandic
Atli Jasonarson, Steinþór Steingrímsson, Einar Sigurðsson, Árni Magnússon, Finnur Ingimundarson
Abstract
We describe work on enhancing the performance of transformer-based encoder-decoder models for OCR post-correction on modern and historical Icelandic texts, where OCRed data are scarce. We trained six models, four from scratch and two fine-tuned versions of Google’s ByT5, on a combination of real data and texts populated with artificially generated errors. Our results show that the models trained from scratch, as opposed to the fine-tuned versions, benefited the most from the addition of artificially generated errors.- Anthology ID:
- 2023.nodalida-1.29
- Volume:
- Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
- Month:
- May
- Year:
- 2023
- Address:
- Tórshavn, Faroe Islands
- Editors:
- Tanel Alumäe, Mark Fishel
- Venue:
- NoDaLiDa
- SIG:
- Publisher:
- University of Tartu Library
- Note:
- Pages:
- 286–291
- Language:
- URL:
- https://aclanthology.org/2023.nodalida-1.29
- DOI:
- Cite (ACL):
- Atli Jasonarson, Steinþór Steingrímsson, Einar Sigurðsson, Árni Magnússon, and Finnur Ingimundarson. 2023. Generating Errors: OCR Post-Processing for Icelandic. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 286–291, Tórshavn, Faroe Islands. University of Tartu Library.
- Cite (Informal):
- Generating Errors: OCR Post-Processing for Icelandic (Jasonarson et al., NoDaLiDa 2023)
- PDF:
- https://preview.aclanthology.org/fix-dup-bibkey/2023.nodalida-1.29.pdf