Bootstrapping AI: Interdisciplinary Approaches to Assessing OCR Quality in English-Language Historical Documents

Samuel Backer, Louis Hyman


Abstract
New LLM-based OCR and post-OCR correction methods promise to transform computational historical research, yet their efficacy remains contested. We compare multiple correction approaches, including methods for “bootstrapping” fine-tuning with LLM-generated data, and measure their effect on downstream tasks. Our results suggest that standard OCR metrics often underestimate performance gains for historical research, underscoring the need for discipline-driven evaluations that can better reflect the needs of computational humanists.
Anthology ID:
2025.nlp4dh-1.21
Volume:
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities
Month:
May
Year:
2025
Address:
Albuquerque, USA
Editors:
Mika Hämäläinen, Emily Öhman, Yuri Bizzoni, So Miyagawa, Khalid Alnajjar
Venues:
NLP4DH | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
251–256
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.nlp4dh-1.21/
DOI:
Bibkey:
Cite (ACL):
Samuel Backer and Louis Hyman. 2025. Bootstrapping AI: Interdisciplinary Approaches to Assessing OCR Quality in English-Language Historical Documents. In Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities, pages 251–256, Albuquerque, USA. Association for Computational Linguistics.
Cite (Informal):
Bootstrapping AI: Interdisciplinary Approaches to Assessing OCR Quality in English-Language Historical Documents (Backer & Hyman, NLP4DH 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.nlp4dh-1.21.pdf