Systematic Textual Availability of Manuscripts
Hadar Miller, Samuel Londner, Tsvi Kuflik, Daria Vasyutinsky Shapira, Nachum Dershowitz, Moshe Lavee
Abstract
33 The digital era has made millions of manuscript images in Hebrew available to all. However, despite major advancements in handwritten text recognition over the past decade, an efficient pipeline for large scale and accurate conversion of these manuscripts into useful machine-readable form is still sorely lacking.We propose a pipeline that significantly improves recognition models for automatic transcription of Hebrew manuscripts. Transfer learning is used to fine-tune pretrained models. For post-recognition correction, it leverages text reuse, a common phenomenon in medieval manuscripts, and state-of-the-art large language models for medieval Hebrew.The framework successfully handles noisy transcriptions and consistently suggests alternate, better readings. Initial results show that word level accuracy increased by 10% for new readings proposed by text-reuse detection. Moreover, the character level accuracy improved by 18% by fine-tuning models on the first few pages of each manuscript.- Anthology ID:
- 2025.ldk-1.18
- Volume:
- Proceedings of the 5th Conference on Language, Data and Knowledge
- Month:
- September
- Year:
- 2025
- Address:
- Naples, Italy
- Editors:
- Mehwish Alam, Andon Tchechmedjiev, Jorge Gracia, Dagmar Gromann, Maria Pia di Buono, Johanna Monti, Maxim Ionov
- Venues:
- LDK | WS
- SIG:
- Publisher:
- Unior Press
- Note:
- Pages:
- 162–173
- Language:
- URL:
- https://preview.aclanthology.org/ldl-25-ingestion/2025.ldk-1.18/
- DOI:
- Cite (ACL):
- Hadar Miller, Samuel Londner, Tsvi Kuflik, Daria Vasyutinsky Shapira, Nachum Dershowitz, and Moshe Lavee. 2025. Systematic Textual Availability of Manuscripts. In Proceedings of the 5th Conference on Language, Data and Knowledge, pages 162–173, Naples, Italy. Unior Press.
- Cite (Informal):
- Systematic Textual Availability of Manuscripts (Miller et al., LDK 2025)
- PDF:
- https://preview.aclanthology.org/ldl-25-ingestion/2025.ldk-1.18.pdf