Abstract
In the growing domain of natural language processing, low-resourced languages like Northern Kurdish remain largely unexplored due to the lack of resources needed to be part of this growth. In particular, the tasks of part-of-speech tagging and tokenization for Northern Kurdish are still insufficiently addressed. In this study, we aim to bridge this gap by evaluating a range of statistical, neural, and fine-tuned-based models specifically tailored for Northern Kurdish. Leveraging limited but valuable datasets, including the Universal Dependency Kurmanji treebank and a novel manually annotated and tokenized gold-standard dataset consisting of 136 sentences (2,937 tokens). We evaluate several POS tagging models and report that the fine-tuned transformer-based model outperforms others, achieving an accuracy of 0.87 and a macro-averaged F1 score of 0.77. Data and models are publicly available under an open license at https://github.com/peshmerge/northern-kurdish-pos-tagging- Anthology ID:
- 2024.mwe-1.11
- Volume:
- Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024
- Month:
- May
- Year:
- 2024
- Address:
- Torino, Italia
- Editors:
- Archna Bhatia, Gosse Bouma, A. Seza Doğruöz, Kilian Evang, Marcos Garcia, Voula Giouli, Lifeng Han, Joakim Nivre, Alexandre Rademaker
- Venues:
- MWE | UDW | WS
- SIGs:
- SIGPARSE | SIGLEX
- Publisher:
- ELRA and ICCL
- Note:
- Pages:
- 70–80
- Language:
- URL:
- https://preview.aclanthology.org/remove-affiliations/2024.mwe-1.11/
- DOI:
- Cite (ACL):
- Peshmerge Morad, Sina Ahmadi, and Lorenzo Gatti. 2024. Part-of-Speech Tagging for Northern Kurdish. In Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024, pages 70–80, Torino, Italia. ELRA and ICCL.
- Cite (Informal):
- Part-of-Speech Tagging for Northern Kurdish (Morad et al., MWE-UDW 2024)
- PDF:
- https://preview.aclanthology.org/remove-affiliations/2024.mwe-1.11.pdf