Humaira Mehmood


2025

pdf bib
Human-Evaluated Urdu-English Speech Corpus: Advancing Speech-to-Text for Low-Resource Languages
Humaira Mehmood | Sadaf Abdul Rauf
Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)

This paper presents our contribution to the IWSLT Low Resource Track 2: ‘Training and Evaluation Data Track’. We share a human-evaluated Urdu-English speech-to-text corpus based on Common Voice 13.0 Urdu speech corpus. We followed a three-tier validation scheme which involves an initial automatic translation with corrections from native reviewers, full review by evaluators followed by final validation from a bilingual expert ensuring reliable corpus for subsequent NLP tasks. Our contribution, CV-UrEnST corpus, enriches Urdu speech resources by contributing the first Urdu-English speech-to-text corpus. When evaluated with Whisper-medium, the corpus yielded a significant improvement to the vanilla model in terms of BLEU, chrF++, and COMET scores, demonstrating its effectiveness for speech translation tasks.