Human-Evaluated Urdu-English Speech Corpus: Advancing Speech-to-Text for Low-Resource Languages

Humaira Mehmood; Sadaf Abdul-Rauf

Human-Evaluated Urdu-English Speech Corpus: Advancing Speech-to-Text for Low-Resource Languages

Abstract

This paper presents our contribution to the IWSLT Low Resource Track 2: ‘Training and Evaluation Data Track’. We share a human-evaluated Urdu-English speech-to-text corpus based on Common Voice 13.0 Urdu speech corpus. We followed a three-tier validation scheme which involves an initial automatic translation with corrections from native reviewers, full review by evaluators followed by final validation from a bilingual expert ensuring reliable corpus for subsequent NLP tasks. Our contribution, CV-UrEnST corpus, enriches Urdu speech resources by contributing the first Urdu-English speech-to-text corpus. When evaluated with Whisper-medium, the corpus yielded a significant improvement to the vanilla model in terms of BLEU, chrF++, and COMET scores, demonstrating its effectiveness for speech translation tasks.

Anthology ID:: 2025.iwslt-1.12
Volume:: Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)
Month:: July
Year:: 2025
Address:: Vienna, Austria (in-person and online)
Editors:: Elizabeth Salesky, Marcello Federico, Antonis Anastasopoulos
Venues:: IWSLT | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 138–144
Language:
URL:: https://preview.aclanthology.org/landing_page/2025.iwslt-1.12/
DOI:
Bibkey:
Cite (ACL):: Humaira Mehmood and Sadaf Abdul Rauf. 2025. Human-Evaluated Urdu-English Speech Corpus: Advancing Speech-to-Text for Low-Resource Languages. In Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025), pages 138–144, Vienna, Austria (in-person and online). Association for Computational Linguistics.
Cite (Informal):: Human-Evaluated Urdu-English Speech Corpus: Advancing Speech-to-Text for Low-Resource Languages (Mehmood & Abdul Rauf, IWSLT 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2025.iwslt-1.12.pdf

PDF Cite Search Fix data