Abstract
A big unknown in Digital Humanities (DH) projects that seek to analyze previously untouched corpora is the question of how to adapt existing Natural Language Processing (NLP) resources to the specific nature of the target corpus. In this paper, we study the case of Emergent Modern Hebrew (EMH), an under-resourced chronolect of the Hebrew language. The resource we seek to adapt, a diacritizer, exists for both earlier and later chronolects of the language. Given a small annotated corpus of our target chronolect, we demonstrate that applying transfer-learning from either of the chronolects is preferable to training a new model from scratch. Furthermore, we consider just how much annotated data is necessary. For our task, we find that even a minimal corpus of 50K tokens provides a noticeable gain in accuracy. At the same time, we also evaluate accuracy at three additional increments, in order to quantify the gains that can be expected by investing in a larger annotated corpus.- Anthology ID:
- 2021.nlp4dh-1.12
- Volume:
- Proceedings of the Workshop on Natural Language Processing for Digital Humanities
- Month:
- December
- Year:
- 2021
- Address:
- NIT Silchar, India
- Editors:
- Mika Hämäläinen, Khalid Alnajjar, Niko Partanen, Jack Rueter
- Venue:
- NLP4DH
- SIG:
- Publisher:
- NLP Association of India (NLPAI)
- Note:
- Pages:
- 106–110
- Language:
- URL:
- https://aclanthology.org/2021.nlp4dh-1.12
- DOI:
- Cite (ACL):
- Aynat Rubinstein and Avi Shmidman. 2021. NLP in the DH pipeline: Transfer-learning to a Chronolect. In Proceedings of the Workshop on Natural Language Processing for Digital Humanities, pages 106–110, NIT Silchar, India. NLP Association of India (NLPAI).
- Cite (Informal):
- NLP in the DH pipeline: Transfer-learning to a Chronolect (Rubinstein & Shmidman, NLP4DH 2021)
- PDF:
- https://preview.aclanthology.org/fix-dup-bibkey/2021.nlp4dh-1.12.pdf