Human-in-the-Loop Mass Transcription and Ground Truth Annotation for Challenging Historical Documents

Norbert Fischer, Frank Puppe


Abstract
Challenging historical documents still pose significant difficulties for fully automatic layout detection and text recognition, requiring lengthy, demanding correction. We describe our experiences with complex layouts and present our workflow with AdaptOCR, a web-based annotation tool designed to facilitate the efficient transcription and ground-truth annotation of demanding historical documents. Addressing the limitations of existing solutions, AdaptOCR prioritizes a streamlined workflow with an integrated trainable layout and OCR pipeline. The tool uses the PAGE standard to represent document structure and enables the annotation of baselines, regions, text lines and the correction of their transcriptions providing automatic OCR invocation and dictionary-based error detection. Furthermore, it supports flexible annotations with custom element types and attributes to cater to different project requirements. We demonstrate the effectiveness of the workflow and tool in two demanding applications: The transcription of a large corpus of historical printings and the detection / annotation of handwritten artifacts within the private library of the Grimm brothers. In addition, we evaluate the dictionary-based correction and assess the efficiency improvements using AdaptOCR in a pilot study.
Anthology ID:
2026.lrec-main.559
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
7023–7033
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.559/
DOI:
Bibkey:
Cite (ACL):
Norbert Fischer and Frank Puppe. 2026. Human-in-the-Loop Mass Transcription and Ground Truth Annotation for Challenging Historical Documents. International Conference on Language Resources and Evaluation, main:7023–7033.
Cite (Informal):
Human-in-the-Loop Mass Transcription and Ground Truth Annotation for Challenging Historical Documents (Fischer & Puppe, LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.559.pdf