NLP in the DH pipeline: Transfer-learning to a Chronolect

Aynat Rubinstein, Avi Shmidman


Abstract
A big unknown in Digital Humanities (DH) projects that seek to analyze previously untouched corpora is the question of how to adapt existing Natural Language Processing (NLP) resources to the specific nature of the target corpus. In this paper, we study the case of Emergent Modern Hebrew (EMH), an under-resourced chronolect of the Hebrew language. The resource we seek to adapt, a diacritizer, exists for both earlier and later chronolects of the language. Given a small annotated corpus of our target chronolect, we demonstrate that applying transfer-learning from either of the chronolects is preferable to training a new model from scratch. Furthermore, we consider just how much annotated data is necessary. For our task, we find that even a minimal corpus of 50K tokens provides a noticeable gain in accuracy. At the same time, we also evaluate accuracy at three additional increments, in order to quantify the gains that can be expected by investing in a larger annotated corpus.
Anthology ID:
2021.nlp4dh-1.12
Volume:
Proceedings of the Workshop on Natural Language Processing for Digital Humanities
Month:
December
Year:
2021
Address:
NIT Silchar, India
Venue:
NLP4DH
SIG:
Publisher:
NLP Association of India (NLPAI)
Note:
Pages:
106–110
Language:
URL:
https://aclanthology.org/2021.nlp4dh-1.12
DOI:
Bibkey:
Cite (ACL):
Aynat Rubinstein and Avi Shmidman. 2021. NLP in the DH pipeline: Transfer-learning to a Chronolect. In Proceedings of the Workshop on Natural Language Processing for Digital Humanities, pages 106–110, NIT Silchar, India. NLP Association of India (NLPAI).
Cite (Informal):
NLP in the DH pipeline: Transfer-learning to a Chronolect (Rubinstein & Shmidman, NLP4DH 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/auto-file-uploads/2021.nlp4dh-1.12.pdf