OCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches

Jenna Kanerva, Cassandra Ledins, Siiri Käpyaho, Filip Ginter


Abstract
Optical Character Recognition (OCR) systems often introduce errors when transcribing historical documents, leaving room for post-correction to improve text quality. This study evaluates the use of open-weight LLMs for OCR error correction in historical English and Finnish datasets. We explore various strategies, including parameter optimization, quantization, segment length effects, and text continuation methods. Our results demonstrate that while modern LLMs show promise in reducing character error rates (CER) in English, a practically useful performance for Finnish was not reached. Our findings highlight the potential and limitations of LLMs in scaling OCR post-correction for large historical corpora.
Anthology ID:
2025.resourceful-1.8
Volume:
Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)
Month:
March
Year:
2025
Address:
Tallinn, Estonia
Editors:
Špela Arhar Holdt, Nikolai Ilinykh, Barbara Scalvini, Micaella Bruton, Iben Nyholm Debess, Crina Madalina Tudor
Venues:
RESOURCEFUL | WS
SIG:
Publisher:
University of Tartu Library, Estonia
Note:
Pages:
38–47
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.resourceful-1.8/
DOI:
Bibkey:
Cite (ACL):
Jenna Kanerva, Cassandra Ledins, Siiri Käpyaho, and Filip Ginter. 2025. OCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches. In Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025), pages 38–47, Tallinn, Estonia. University of Tartu Library, Estonia.
Cite (Informal):
OCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches (Kanerva et al., RESOURCEFUL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.resourceful-1.8.pdf