Samuel Londner
2025
Systematic Textual Availability of Manuscripts
Hadar Miller
|
Samuel Londner
|
Tsvi Kuflik
|
Daria Vasyutinsky Shapira
|
Nachum Dershowitz
|
Moshe Lavee
Proceedings of the 5th Conference on Language, Data and Knowledge
33 The digital era has made millions of manuscript images in Hebrew available to all. However, despite major advancements in handwritten text recognition over the past decade, an efficient pipeline for large scale and accurate conversion of these manuscripts into useful machine-readable form is still sorely lacking.We propose a pipeline that significantly improves recognition models for automatic transcription of Hebrew manuscripts. Transfer learning is used to fine-tune pretrained models. For post-recognition correction, it leverages text reuse, a common phenomenon in medieval manuscripts, and state-of-the-art large language models for medieval Hebrew.The framework successfully handles noisy transcriptions and consistently suggests alternate, better readings. Initial results show that word level accuracy increased by 10% for new readings proposed by text-reuse detection. Moreover, the character level accuracy improved by 18% by fine-tuning models on the first few pages of each manuscript.