Morphological Feature Extraction for Fine-Grained Sorani Kurdish Dialect Identification: A Hybrid Transformer-Linguistic Approach
Soumedhik Bharati, Shibam Mandal, Subham Majumdar, Swarup Kr Ghosh, Sayani Mondal
Abstract
As reported, approximately 6 million people in Iraq and Iran speak in Sorani Kurdish, which exhibits substantial regional variation but lacks computational resources for dialect identification. We present the first fine-grained sub-dialect classification system for six Sorani varieties namely, Sulaymaniyah, Erbil, Iranian Sorani, Ardalani, Babani, and Mukriani. This investigation combines cross-lingual contextual embeddings (XLM-RoBERTa) with morphological features derived from explicit linguistic rules, including 24 patterns capturing verb prefixes, pronominal clitics, and definite markers. The suggested morphology-augmented XLM-R model has been trained on a unified dataset of 16,409 sentences without manual annotation, and achieves 91.91% accuracy, outperforming pure transformers (91.79%) and traditional machine learning baselines (SVM 86.41%). Key ablation studies reveal that morphological features serve as effective regularizers for geographically proximate dialects.- Anthology ID:
- 2026.abjadnlp-1.24
- Volume:
- Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
- Month:
- March
- Year:
- 2026
- Address:
- Rabat, Morocco
- Venues:
- AbjadNLP | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 172–176
- Language:
- URL:
- https://preview.aclanthology.org/manual-author-scripts/2026.abjadnlp-1.24/
- DOI:
- Cite (ACL):
- Soumedhik Bharati, Shibam Mandal, Subham Majumdar, Swarup Kr Ghosh, and Sayani Mondal. 2026. Morphological Feature Extraction for Fine-Grained Sorani Kurdish Dialect Identification: A Hybrid Transformer-Linguistic Approach. In Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script, pages 172–176, Rabat, Morocco. Association for Computational Linguistics.
- Cite (Informal):
- Morphological Feature Extraction for Fine-Grained Sorani Kurdish Dialect Identification: A Hybrid Transformer-Linguistic Approach (Bharati et al., AbjadNLP 2026)
- PDF:
- https://preview.aclanthology.org/manual-author-scripts/2026.abjadnlp-1.24.pdf