Evaluating Transformers for OCR Post-Correction in Early Modern Dutch Theatre
Florian Debaene, Aaron Maladry, Els Lefever, Veronique Hoste
Abstract
This paper explores the effectiveness of two types of transformer models — large generative models and sequence-to-sequence models — for automatically post-correcting Optical Character Recognition (OCR) output in early modern Dutch plays. To address the need for optimally aligned data, we create a parallel dataset based on the OCRed and ground truth versions from the EmDComF corpus using state-of-the-art alignment techniques. By combining character-based and semantic methods, we design and release a qualitative OCR-to-gold parallel dataset, selecting the alignment with the lowest Character Error Rate (CER) for all alignment pairs. We then fine-tune and evaluate five generative models and four sequence-to-sequence models on the OCR post-correction dataset. Results show that sequence-to-sequence models generally outperform generative models in this task, correcting more OCR errors and overgenerating and undergenerating less, with mBART as the best performing system.- Anthology ID:
- 2025.coling-main.690
- Original:
- 2025.coling-main.690v1
- Version 2:
- 2025.coling-main.690v2
- Volume:
- Proceedings of the 31st International Conference on Computational Linguistics
- Month:
- January
- Year:
- 2025
- Address:
- Abu Dhabi, UAE
- Editors:
- Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
- Venue:
- COLING
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 10367–10374
- Language:
- URL:
- https://preview.aclanthology.org/fix-sig-urls/2025.coling-main.690/
- DOI:
- Cite (ACL):
- Florian Debaene, Aaron Maladry, Els Lefever, and Veronique Hoste. 2025. Evaluating Transformers for OCR Post-Correction in Early Modern Dutch Theatre. In Proceedings of the 31st International Conference on Computational Linguistics, pages 10367–10374, Abu Dhabi, UAE. Association for Computational Linguistics.
- Cite (Informal):
- Evaluating Transformers for OCR Post-Correction in Early Modern Dutch Theatre (Debaene et al., COLING 2025)
- PDF:
- https://preview.aclanthology.org/fix-sig-urls/2025.coling-main.690.pdf