Hillel Gershuni
2026
Human-AI Annotation Error Auditing for Hebrew Diacritization with Frontier LLMs
Hillel Gershuni | Avi Shmidman
Proceedings of the 20th Linguistic Annotation Workshop (LAW XX)
Hillel Gershuni | Avi Shmidman
Proceedings of the 20th Linguistic Annotation Workshop (LAW XX)
Large annotated datasets inevitably contain errors that are costly to identify via manual review. We study a human-AI annotation error auditing workflow using frontier Large Language Models (LLMs), focusing on Hebrew nikud (diacritization). We take the the EACL 2023 Hebrew Homograph Challenge Set as our test case. In a focused evaluation on 12 of the homograph sets with 271 confirmed errors (verified through exhaustive manual review of all 7,241 sentences), Gemini 3 Pro achieves 83.6% recall (95% confidence interval: [79.3%, 88.2%]) and 99.1% precision - substantially higher than other frontier LLMs. Two independent human experts achieved 62.4% and 42.8% recall respectively, a 20-percentage-point spread that reflects the difficulty of sparse-target error search. Even the union of both experts’ findings (73.4% recall) falls short of a single LLM run (83.6%), while LLM-aided auditing reduces review effort by over 95%. We analyze the trade-offs between batch size and recall, and release both a human-verified Gold Standard with per-error difficulty annotations and a globally corrected version of the Challenge Set.
2025
Automatic Text Segmentation of Ancient and Historic Hebrew
Elisha Rosensweig | Benjamin Resnick | Hillel Gershuni | Joshua Guedalia | Nachum Dershowitz | Avi Shmidman
Proceedings of the Second Workshop on Ancient Language Processing
Elisha Rosensweig | Benjamin Resnick | Hillel Gershuni | Joshua Guedalia | Nachum Dershowitz | Avi Shmidman
Proceedings of the Second Workshop on Ancient Language Processing
Ancient texts often lack punctuation marks, making it challenging to determine sentence boundaries and clause boundaries. Texts may contain sequences of hundreds of words without any period or indication of a full stop. Determining such boundaries is a crucial step in various NLP pipelines, especially regarding language models such as BERT that have context window constraints and regarding machine translation models which may become far less accurate when fed too much text at a time. In this paper, we consider several novel approaches to automatic segmentation of unpunctuated ancient texts into grammatically complete or semi-complete units. Our work here focuses on ancient and historical Hebrew and Aramaic texts, but the tools developed can be applied equally to similar languages. We explore several approaches to addressing this task: masked language models (MLM) to predict the next token; fewshot completions via an open-source foundational LLM; and the “Segment-Any-Text” (SaT) tool by Frohmann et al. (Frohmann et al., 2024). These are then compared to instructbased flows using commercial (closed, managed) LLMs, to be used as a benchmark. To evaluate these approaches, we also introduce a new ground truth (GT) dataset of manually segmented texts. We explore the performance of our different approaches on this dataset. We release both our segmentation tools and the dataset to support further research into computational processing and analysis of ancient texts, which can be found here ‘https://github.com/ERC-Midrash/rabbinic_chunker’.
2024
MsBERT: A New Model for the Reconstruction of Lacunae in Hebrew Manuscripts
Avi Shmidman | Ometz Shmidman | Hillel Gershuni | Moshe Koppel
Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)
Avi Shmidman | Ometz Shmidman | Hillel Gershuni | Moshe Koppel
Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)
Hebrew manuscripts preserve thousands of textual transmissions of post-Biblical Hebrew texts from the first millennium. In many cases, the text in the manuscripts is not fully decipherable, whether due to deterioration, perforation, burns, or otherwise. Existing BERT models for Hebrew struggle to fill these gaps, due to the many orthographical deviations found in Hebrew manuscripts. We have pretrained a new dedicated BERT model, dubbed MsBERT (short for: Manuscript BERT), designed from the ground up to handle Hebrew manuscript text. MsBERT substantially outperforms all existing Hebrew BERT models regarding the prediction of missing words in fragmentary Hebrew manuscript transcriptions in multiple genres, as well as regarding the task of differentiating between quoted passages and exegetical elaborations. We provide MsBERT for free download and unrestricted use, and we also provide an interactive and user-friendly website to allow manuscripts scholars to leverage the power of MsBERT in their scholarly work of reconstructing fragmentary Hebrew manuscripts.