Kuvost: A Large-Scale Human-Annotated English to Central Kurdish Speech Translation Dataset Driven from English Common Voice

Mohammad Mohammadamini, Daban Jaff, Sara Jamal, Ibrahim Ahmed, Hawkar Omar, Darya Sabr, Marie Tahon, Antoine Laurent


Abstract
In this paper, we introduce the Kuvost, a large-scale English to Central Kurdish speech-to-text-translation (S2TT) dataset. This dataset includes 786k utterances derived from Common Voice 18, translated and revised by 230 volunteers into Central Kurdish. Encompassing 1,003 hours of translated speech, this dataset can play a groundbreaking role for Central Kurdish, which severely lacks public-domain resources for speech translation. Following the dataset division in Common Voice, there are 298k, 6,226, and 7,253 samples in the train, development, and test sets, respectively. The dataset is evaluated on end-to-end English-to-Kurdish S2TT using Whisper V3 Large and SeamlessM4T V2 Large models. The dataset is available under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License https://huggingface.co/datasets/aranemini/kuvost.
Anthology ID:
2025.iwslt-1.9
Volume:
Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)
Month:
July
Year:
2025
Address:
Vienna, Austria (in-person and online)
Editors:
Elizabeth Salesky, Marcello Federico, Antonis Anastasopoulos
Venues:
IWSLT | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
106–109
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.iwslt-1.9/
DOI:
Bibkey:
Cite (ACL):
Mohammad Mohammadamini, Daban Jaff, Sara Jamal, Ibrahim Ahmed, Hawkar Omar, Darya Sabr, Marie Tahon, and Antoine Laurent. 2025. Kuvost: A Large-Scale Human-Annotated English to Central Kurdish Speech Translation Dataset Driven from English Common Voice. In Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025), pages 106–109, Vienna, Austria (in-person and online). Association for Computational Linguistics.
Cite (Informal):
Kuvost: A Large-Scale Human-Annotated English to Central Kurdish Speech Translation Dataset Driven from English Common Voice (Mohammadamini et al., IWSLT 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.iwslt-1.9.pdf