Hawkar Omar

2025

pdf bib abs
Kuvost: A Large-Scale Human-Annotated English to Central Kurdish Speech Translation Dataset Driven from English Common Voice
Mohammad Mohammadamini | Daban Jaff | Sara Jamal | Ibrahim Ahmed | Hawkar Omar | Darya Sabr | Marie Tahon | Antoine Laurent
Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)

In this paper, we introduce the Kuvost, a large-scale English to Central Kurdish speech-to-text-translation (S2TT) dataset. This dataset includes 786k utterances derived from Common Voice 18, translated and revised by 230 volunteers into Central Kurdish. Encompassing 1,003 hours of translated speech, this dataset can play a groundbreaking role for Central Kurdish, which severely lacks public-domain resources for speech translation. Following the dataset division in Common Voice, there are 298k, 6,226, and 7,253 samples in the train, development, and test sets, respectively. The dataset is evaluated on end-to-end English-to-Kurdish S2TT using Whisper V3 Large and SeamlessM4T V2 Large models. The dataset is available under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License https://huggingface.co/datasets/aranemini/kuvost.

Co-authors

Darya Sabr 1

Marie Tahon 1

Venues

iwslt1
ws1

Fix author