2021
pdf
abs
NLP in the DH pipeline: Transfer-learning to a Chronolect
Aynat Rubinstein
|
Avi Shmidman
Proceedings of the Workshop on Natural Language Processing for Digital Humanities
A big unknown in Digital Humanities (DH) projects that seek to analyze previously untouched corpora is the question of how to adapt existing Natural Language Processing (NLP) resources to the specific nature of the target corpus. In this paper, we study the case of Emergent Modern Hebrew (EMH), an under-resourced chronolect of the Hebrew language. The resource we seek to adapt, a diacritizer, exists for both earlier and later chronolects of the language. Given a small annotated corpus of our target chronolect, we demonstrate that applying transfer-learning from either of the chronolects is preferable to training a new model from scratch. Furthermore, we consider just how much annotated data is necessary. For our task, we find that even a minimal corpus of 50K tokens provides a noticeable gain in accuracy. At the same time, we also evaluate accuracy at three additional increments, in order to quantify the gains that can be expected by investing in a larger annotated corpus.
2020
pdf
abs
A Novel Challenge Set for Hebrew Morphological Disambiguation and Diacritics Restoration
Avi Shmidman
|
Joshua Guedalia
|
Shaltiel Shmidman
|
Moshe Koppel
|
Reut Tsarfaty
Findings of the Association for Computational Linguistics: EMNLP 2020
One of the primary tasks of morphological parsers is the disambiguation of homographs. Particularly difficult are cases of unbalanced ambiguity, where one of the possible analyses is far more frequent than the others. In such cases, there may not exist sufficient examples of the minority analyses in order to properly evaluate performance, nor to train effective classifiers. In this paper we address the issue of unbalanced morphological ambiguities in Hebrew. We offer a challenge set for Hebrew homographs — the first of its kind — containing substantial attestation of each analysis of 21 Hebrew homographs. We show that the current SOTA of Hebrew disambiguation performs poorly on cases of unbalanced ambiguity. Leveraging our new dataset, we achieve a new state-of-the-art for all 21 words, improving the overall average F1 score from 0.67 to 0.95. Our resulting annotated datasets are made publicly available for further research.
pdf
abs
Nakdan: Professional Hebrew Diacritizer
Avi Shmidman
|
Shaltiel Shmidman
|
Moshe Koppel
|
Yoav Goldberg
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations
We present a system for automatic diacritization of Hebrew Text. The system combines modern neural models with carefully curated declarative linguistic knowledge and comprehensive manually constructed tables and dictionaries. Besides providing state of the art diacritization accuracy, the system also supports an interface for manual editing and correction of the automatic output, and has several features which make it particularly useful for preparation of scientific editions of historical Hebrew texts. The system supports Modern Hebrew, Rabbinic Hebrew and Poetic Hebrew. The system is freely accessible for all use at http://nakdanpro.dicta.org.il
2016
pdf
abs
Shamela: A Large-Scale Historical Arabic Corpus
Yonatan Belinkov
|
Alexander Magidow
|
Maxim Romanov
|
Avi Shmidman
|
Moshe Koppel
Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)
Arabic is a widely-spoken language with a rich and long history spanning more than fourteen centuries. Yet existing Arabic corpora largely focus on the modern period or lack sufficient diachronic information. We develop a large-scale, historical corpus of Arabic of about 1 billion words from diverse periods of time. We clean this corpus, process it with a morphological analyzer, and enhance it by detecting parallel passages and automatically dating undated texts. We demonstrate its utility with selected case-studies in which we show its application to the digital humanities.