Dialectal Filtering: Synthesizing Kurdish Corpora for Low-Resource Varieties by Utilizing "Noise" in Large Textual Data

Christian Schuler, Raman Ahmad, Ānrán Wáng, Daniil Gurgurov, Timo Baumann, Simon Ostermann, Josef van Genabith


Abstract
This work introduces a dialect-aware text filtering framework to pre-process, clean, and enhance large text corpora, creating variety-specific sub-corpora for neglected language varieties. We apply our framework to Kurdish, a language with rich dialectal diversity, which presents significant challenges for Natural Language Processing due to its low-resource status and the noisy nature of available text corpora. Leveraging lexicographic features, we assign multi-language-labels to text instances and synthesize over 130 dialect specific corpora from large "noisy" data sets containing unlabeled mixtures of Kurdish varieties, representing to our knowledge the largest collection of dialect-specific Kurdish NLP resources to date. This work contributes to the creation of low-resource language technology foundations, especially dialect-specific NLP applications. Specifically, we advance research on Kurdish languages by providing insights into the linguistic relationships among Kurdish varieties.
Anthology ID:
2026.lrec-main.116
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
1505–1519
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.116/
DOI:
Bibkey:
Cite (ACL):
Christian Schuler, Raman Ahmad, Ānrán Wáng, Daniil Gurgurov, Timo Baumann, Simon Ostermann, and Josef van Genabith. 2026. Dialectal Filtering: Synthesizing Kurdish Corpora for Low-Resource Varieties by Utilizing "Noise" in Large Textual Data. International Conference on Language Resources and Evaluation, main:1505–1519.
Cite (Informal):
Dialectal Filtering: Synthesizing Kurdish Corpora for Low-Resource Varieties by Utilizing “Noise” in Large Textual Data (Schuler et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.116.pdf