Automatic Text Segmentation of Ancient and Historic Hebrew
Elisha Rosensweig, Benjamin Resnick, Hillel Gershuni, Joshua Guedalia, Nachum Dershowitz, Avi Shmidman
Abstract
Ancient texts often lack punctuation marks, making it challenging to determine sentence boundaries and clause boundaries. Texts may contain sequences of hundreds of words without any period or indication of a full stop. Determining such boundaries is a crucial step in various NLP pipelines, especially regarding language models such as BERT that have context window constraints and regarding machine translation models which may become far less accurate when fed too much text at a time. In this paper, we consider several novel approaches to automatic segmentation of unpunctuated ancient texts into grammatically complete or semi-complete units. Our work here focuses on ancient and historical Hebrew and Aramaic texts, but the tools developed can be applied equally to similar languages. We explore several approaches to addressing this task: masked language models (MLM) to predict the next token; fewshot completions via an open-source foundational LLM; and the “Segment-Any-Text” (SaT) tool by Frohmann et al. (Frohmann et al., 2024). These are then compared to instructbased flows using commercial (closed, managed) LLMs, to be used as a benchmark. To evaluate these approaches, we also introduce a new ground truth (GT) dataset of manually segmented texts. We explore the performance of our different approaches on this dataset. We release both our segmentation tools and the dataset to support further research into computational processing and analysis of ancient texts, which can be found here ‘https://github.com/ERC-Midrash/rabbinic_chunker’.- Anthology ID:
- 2025.alp-1.1
- Volume:
- Proceedings of the Second Workshop on Ancient Language Processing
- Month:
- May
- Year:
- 2025
- Address:
- The Albuquerque Convention Center, Laguna
- Editors:
- Adam Anderson, Shai Gordin, Bin Li, Yudong Liu, Marco C. Passarotti, Rachele Sprugnoli
- Venues:
- ALP | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1–11
- Language:
- URL:
- https://preview.aclanthology.org/moar-dois/2025.alp-1.1/
- DOI:
- 10.18653/v1/2025.alp-1.1
- Cite (ACL):
- Elisha Rosensweig, Benjamin Resnick, Hillel Gershuni, Joshua Guedalia, Nachum Dershowitz, and Avi Shmidman. 2025. Automatic Text Segmentation of Ancient and Historic Hebrew. In Proceedings of the Second Workshop on Ancient Language Processing, pages 1–11, The Albuquerque Convention Center, Laguna. Association for Computational Linguistics.
- Cite (Informal):
- Automatic Text Segmentation of Ancient and Historic Hebrew (Rosensweig et al., ALP 2025)
- PDF:
- https://preview.aclanthology.org/moar-dois/2025.alp-1.1.pdf