From OCR to Analysis: Tracking Correction Provenance in Digital Humanities Pipelines

Haoze Guo; Ziqi Wei

From OCR to Analysis: Tracking Correction Provenance in Digital Humanities Pipelines

Abstract

Optical Character Recognition (OCR) is a critical but error-prone stage in digital humanities text pipelines. While OCR correction improves usability for downstream NLP tasks, common workflows often overwrite intermediate decisions, obscuring how textual transformations affect scholarly interpretation. We present a provenance-aware framework for OCR-corrected humanities corpora that records correction lineage at the span level, including edit type, correction source, confidence, and revision status. Using a pilot corpus of historical texts, we compare downstream named entity extraction across raw OCR, fully corrected text, and provenance-filtered corrections. Our results show that correction pathways can substantially alter extracted entities and document-level interpretations, while provenance signals help identify unstable outputs and prioritize human review. We argue that provenance should be treated as a first-class analytical layer in NLP for digital humanities, supporting reproducibility, source criticism, and uncertainty-aware interpretation.

Anthology ID:: 2026.nlp4dh-1.1
Volume:: Proceedings of the 6th International Conference on Natural Language Processing for the Digital Humanities
Month:: July
Year:: 2026
Address:: San Diego, USA
Editors:: Sil Hamilton, Emily Öhman, Rebecca M. M. Hicke, Yuri Bizzoni, Axel Bax, Jacob A. Matthews, Mika Hämäläinen
Venues:: NLP4DH | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1–12
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.nlp4dh-1.1/
DOI:
Bibkey:
Cite (ACL):: Haoze Guo and Ziqi Wei. 2026. From OCR to Analysis: Tracking Correction Provenance in Digital Humanities Pipelines. In Proceedings of the 6th International Conference on Natural Language Processing for the Digital Humanities, pages 1–12, San Diego, USA. Association for Computational Linguistics.
Cite (Informal):: From OCR to Analysis: Tracking Correction Provenance in Digital Humanities Pipelines (Guo & Wei, NLP4DH 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.nlp4dh-1.1.pdf

PDF Cite Search Fix data