A Classifier of Word-Level Variants in Witnesses of Biblical Hebrew Manuscripts
Iglika Nikolova-Stoupak, Maxime Amblard, Sophie Robert-Hayek, Frédérique Rey
Abstract
The current project is inscribed within the field of stemmatology or the study and/or reconstruction of textual transmission based on the relationship between the available witnesses of given texts. In particular, the variants (differences) at the word-level in manuscripts written in Biblical Hebrew are considered. A strong classifier (F1 value of 0.80) is trained to predict the category of difference between word pairs (‘plus/minus’, ‘inversion’, ‘morphological’, ‘lexical’ or ‘unclassifiable’) as present in collated (aligned) pairs of witnesses. The classifier is non-neural and makes use of the two words themselves as well as part-of-speech (POS) tags, hand-crafted rules per category and synthetically derived data. Other models experimented with include neural ones based on the state-of-the-art model for Modern Hebrew, DictaBERT. Other features whose relevance is tested are different types of morphological information pertaining to the word pairs and the Levenshtein distance between words. A selection of the strongest classifiers as well as the used synthetic data and the steps taken at its derivation are made available. Coincidentally, the corelation between two sets of morphological labels is investigated: professionally established as per the Qumran-Digital online library and automatically derived with the sub-model DictaBERT-morph.- Anthology ID:
- 2025.findings-acl.1098
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2025
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria
- Editors:
- Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 21313–21329
- Language:
- URL:
- https://preview.aclanthology.org/landing_page/2025.findings-acl.1098/
- DOI:
- Cite (ACL):
- Iglika Nikolova-Stoupak, Maxime Amblard, Sophie Robert-Hayek, and Frédérique Rey. 2025. A Classifier of Word-Level Variants in Witnesses of Biblical Hebrew Manuscripts. In Findings of the Association for Computational Linguistics: ACL 2025, pages 21313–21329, Vienna, Austria. Association for Computational Linguistics.
- Cite (Informal):
- A Classifier of Word-Level Variants in Witnesses of Biblical Hebrew Manuscripts (Nikolova-Stoupak et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/landing_page/2025.findings-acl.1098.pdf