Kuvost: A Large-Scale Human-Annotated English to Central Kurdish Speech Translation Dataset Driven from English Common Voice
Mohammad Mohammadamini, Daban Jaff, Sara Jamal, Ibrahim Ahmed, Hawkar Omar, Darya Sabr, Marie Tahon, Antoine Laurent
Abstract
In this paper, we introduce the Kuvost, a large-scale English to Central Kurdish speech-to-text-translation (S2TT) dataset. This dataset includes 786k utterances derived from Common Voice 18, translated and revised by 230 volunteers into Central Kurdish. Encompassing 1,003 hours of translated speech, this dataset can play a groundbreaking role for Central Kurdish, which severely lacks public-domain resources for speech translation. Following the dataset division in Common Voice, there are 298k, 6,226, and 7,253 samples in the train, development, and test sets, respectively. The dataset is evaluated on end-to-end English-to-Kurdish S2TT using Whisper V3 Large and SeamlessM4T V2 Large models. The dataset is available under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License https://huggingface.co/datasets/aranemini/kuvost.- Anthology ID:
- 2025.iwslt-1.9
- Volume:
- Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria (in-person and online)
- Editors:
- Elizabeth Salesky, Marcello Federico, Antonis Anastasopoulos
- Venues:
- IWSLT | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 106–109
- Language:
- URL:
- https://preview.aclanthology.org/landing_page/2025.iwslt-1.9/
- DOI:
- Cite (ACL):
- Mohammad Mohammadamini, Daban Jaff, Sara Jamal, Ibrahim Ahmed, Hawkar Omar, Darya Sabr, Marie Tahon, and Antoine Laurent. 2025. Kuvost: A Large-Scale Human-Annotated English to Central Kurdish Speech Translation Dataset Driven from English Common Voice. In Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025), pages 106–109, Vienna, Austria (in-person and online). Association for Computational Linguistics.
- Cite (Informal):
- Kuvost: A Large-Scale Human-Annotated English to Central Kurdish Speech Translation Dataset Driven from English Common Voice (Mohammadamini et al., IWSLT 2025)
- PDF:
- https://preview.aclanthology.org/landing_page/2025.iwslt-1.9.pdf