Bootstrapping AI: Interdisciplinary Approaches to Assessing OCR Quality in English-Language Historical Documents

Samuel Backer; Louis Hyman

Bootstrapping AI: Interdisciplinary Approaches to Assessing OCR Quality in English-Language Historical Documents

Abstract

New LLM-based OCR and post-OCR correction methods promise to transform computational historical research, yet their efficacy remains contested. We compare multiple correction approaches, including methods for “bootstrapping” fine-tuning with LLM-generated data, and measure their effect on downstream tasks. Our results suggest that standard OCR metrics often underestimate performance gains for historical research, underscoring the need for discipline-driven evaluations that can better reflect the needs of computational humanists.

Anthology ID:: 2025.nlp4dh-1.21
Volume:: Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities
Month:: May
Year:: 2025
Address:: Albuquerque, USA
Editors:: Mika Hämäläinen, Emily Öhman, Yuri Bizzoni, So Miyagawa, Khalid Alnajjar
Venues:: NLP4DH | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 251–256
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2025.nlp4dh-1.21/
DOI:
Bibkey:
Cite (ACL):: Samuel Backer and Louis Hyman. 2025. Bootstrapping AI: Interdisciplinary Approaches to Assessing OCR Quality in English-Language Historical Documents. In Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities, pages 251–256, Albuquerque, USA. Association for Computational Linguistics.
Cite (Informal):: Bootstrapping AI: Interdisciplinary Approaches to Assessing OCR Quality in English-Language Historical Documents (Backer & Hyman, NLP4DH 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2025.nlp4dh-1.21.pdf

PDF Cite Search Fix data