Dialectal Filtering: Synthesizing Kurdish Corpora for Low-Resource Varieties by Utilizing "Noise" in Large Textual Data
Christian Schuler, Raman Ahmad, Ānrán Wáng, Daniil Gurgurov, Timo Baumann, Simon Ostermann, Josef van Genabith
Abstract
This work introduces a dialect-aware text filtering framework to pre-process, clean, and enhance large text corpora, creating variety-specific sub-corpora for neglected language varieties. We apply our framework to Kurdish, a language with rich dialectal diversity, which presents significant challenges for Natural Language Processing due to its low-resource status and the noisy nature of available text corpora. Leveraging lexicographic features, we assign multi-language-labels to text instances and synthesize over 130 dialect specific corpora from large "noisy" data sets containing unlabeled mixtures of Kurdish varieties, representing to our knowledge the largest collection of dialect-specific Kurdish NLP resources to date. This work contributes to the creation of low-resource language technology foundations, especially dialect-specific NLP applications. Specifically, we advance research on Kurdish languages by providing insights into the linguistic relationships among Kurdish varieties.- Anthology ID:
- 2026.lrec-main.116
- Volume:
- Proceedings of the Fifteenth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2026
- Address:
- Palma de Mallorca, Spain
- Editors:
- Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
- Venue:
- LREC
- SIG:
- Publisher:
- ELRA Language Resource Association
- Note:
- Pages:
- 1505–1519
- Language:
- URL:
- https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.116/
- DOI:
- Cite (ACL):
- Christian Schuler, Raman Ahmad, Ānrán Wáng, Daniil Gurgurov, Timo Baumann, Simon Ostermann, and Josef van Genabith. 2026. Dialectal Filtering: Synthesizing Kurdish Corpora for Low-Resource Varieties by Utilizing "Noise" in Large Textual Data. International Conference on Language Resources and Evaluation, main:1505–1519.
- Cite (Informal):
- Dialectal Filtering: Synthesizing Kurdish Corpora for Low-Resource Varieties by Utilizing “Noise” in Large Textual Data (Schuler et al., LREC 2026)
- PDF:
- https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.116.pdf