Abstract
In this study we apply classification methods for detecting subdialectal differences in Sorani Kurdish texts produced in different regions, namely Iran and Iraq. As Sorani is a low-resource language, no corpus including texts from different regions was readily available. To this end, we identified data sources that could be leveraged for this task to create a dataset of 200,000 sentences. Using surface features, we attempted to classify Sorani subdialects, showing that sentences from news sources in Iraq and Iran are distinguishable with 96% accuracy. This is the first preliminary study for a dialect that has not been widely studied in computational linguistics, evidencing the possible existence of distinct subdialects.- Anthology ID:
- W16-4812
- Volume:
- Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
- Month:
- December
- Year:
- 2016
- Address:
- Osaka, Japan
- Editors:
- Preslav Nakov, Marcos Zampieri, Liling Tan, Nikola Ljubešić, Jörg Tiedemann, Shervin Malmasi
- Venue:
- VarDial
- SIG:
- Publisher:
- The COLING 2016 Organizing Committee
- Note:
- Pages:
- 89–96
- Language:
- URL:
- https://aclanthology.org/W16-4812
- DOI:
- Cite (ACL):
- Shervin Malmasi. 2016. Subdialectal Differences in Sorani Kurdish. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pages 89–96, Osaka, Japan. The COLING 2016 Organizing Committee.
- Cite (Informal):
- Subdialectal Differences in Sorani Kurdish (Malmasi, VarDial 2016)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/W16-4812.pdf