Human-Evaluated Urdu-English Speech Corpus: Advancing Speech-to-Text for Low-Resource Languages

Humaira Mehmood, Sadaf Abdul Rauf


Abstract
This paper presents our contribution to the IWSLT Low Resource Track 2: ‘Training and Evaluation Data Track’. We share a human-evaluated Urdu-English speech-to-text corpus based on Common Voice 13.0 Urdu speech corpus. We followed a three-tier validation scheme which involves an initial automatic translation with corrections from native reviewers, full review by evaluators followed by final validation from a bilingual expert ensuring reliable corpus for subsequent NLP tasks. Our contribution, CV-UrEnST corpus, enriches Urdu speech resources by contributing the first Urdu-English speech-to-text corpus. When evaluated with Whisper-medium, the corpus yielded a significant improvement to the vanilla model in terms of BLEU, chrF++, and COMET scores, demonstrating its effectiveness for speech translation tasks.
Anthology ID:
2025.iwslt-1.12
Volume:
Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)
Month:
July
Year:
2025
Address:
Vienna, Austria (in-person and online)
Editors:
Elizabeth Salesky, Marcello Federico, Antonis Anastasopoulos
Venues:
IWSLT | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
138–144
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.iwslt-1.12/
DOI:
Bibkey:
Cite (ACL):
Humaira Mehmood and Sadaf Abdul Rauf. 2025. Human-Evaluated Urdu-English Speech Corpus: Advancing Speech-to-Text for Low-Resource Languages. In Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025), pages 138–144, Vienna, Austria (in-person and online). Association for Computational Linguistics.
Cite (Informal):
Human-Evaluated Urdu-English Speech Corpus: Advancing Speech-to-Text for Low-Resource Languages (Mehmood & Abdul Rauf, IWSLT 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.iwslt-1.12.pdf