Joshua Guedalia

2025

Ancient texts often lack punctuation marks, making it challenging to determine sentence boundaries and clause boundaries. Texts may contain sequences of hundreds of words without any period or indication of a full stop. Determining such boundaries is a crucial step in various NLP pipelines, especially regarding language models such as BERT that have context window constraints and regarding machine translation models which may become far less accurate when fed too much text at a time. In this paper, we consider several novel approaches to automatic segmentation of unpunctuated ancient texts into grammatically complete or semi-complete units. Our work here focuses on ancient and historical Hebrew and Aramaic texts, but the tools developed can be applied equally to similar languages. We explore several approaches to addressing this task: masked language models (MLM) to predict the next token; fewshot completions via an open-source foundational LLM; and the “Segment-Any-Text” (SaT) tool by Frohmann et al. (Frohmann et al., 2024). These are then compared to instructbased flows using commercial (closed, managed) LLMs, to be used as a benchmark. To evaluate these approaches, we also introduce a new ground truth (GT) dataset of manually segmented texts. We explore the performance of our different approaches on this dataset. We release both our segmentation tools and the dataset to support further research into computational processing and analysis of ancient texts, which can be found here ‘https://github.com/ERC-Midrash/rabbinic_chunker’.

2020

pdf bib abs
A Novel Challenge Set for Hebrew Morphological Disambiguation and Diacritics Restoration
Avi Shmidman | Joshua Guedalia | Shaltiel Shmidman | Moshe Koppel | Reut Tsarfaty
Findings of the Association for Computational Linguistics: EMNLP 2020

One of the primary tasks of morphological parsers is the disambiguation of homographs. Particularly difficult are cases of unbalanced ambiguity, where one of the possible analyses is far more frequent than the others. In such cases, there may not exist sufficient examples of the minority analyses in order to properly evaluate performance, nor to train effective classifiers. In this paper we address the issue of unbalanced morphological ambiguities in Hebrew. We offer a challenge set for Hebrew homographs — the first of its kind — containing substantial attestation of each analysis of 21 Hebrew homographs. We show that the current SOTA of Hebrew disambiguation performs poorly on cases of unbalanced ambiguity. Leveraging our new dataset, we achieve a new state-of-the-art for all 21 words, improving the overall average F1 score from 0.67 to 0.95. Our resulting annotated datasets are made publicly available for further research.

Co-authors

Elisha Rosensweig 1

Shaltiel Shmidman 1

Reut Tsarfaty 1

Venues

Fix data