Jenalea Rajab

2025

This paper presents the Esethu Framework, a sustainable data curation framework specifically designed to empower local communities and ensure equitable benefit-sharing from their linguistic resource. This framework is supported by the Esethu license, a novel community-centric data license. As a proof of concept, we introduce the Vuk’uzenzele isiXhosa Speech Dataset (ViXSD), an open-source corpus developed under the Esethu Framework and License. The dataset, containing read speech from native isiXhosa speakers enriched with demographic and linguistic metadata, demonstrates how community-driven licensing and curation principles can bridge resource gaps in automatic speech recognition (ASR) for African languages while safeguarding the interests of data creators. We describe the framework guiding dataset development, outline the Esethu license provisions, present the methodology for ViXSD, and present ASR experiments validating ViXSD’s usability in building and refining voice-driven applications for isiXhosa.

2023

pdf bib abs
Preparing the Vuk’uzenzele and ZA-gov-multilingual South African multilingual corpora
Richard Lastrucci | Jenalea Rajab | Matimba Shingange | Daniel Njini | Vukosi Marivate
Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)

This paper introduces two multilingual government themed corpora in various South African languages. The corpora were collected by gathering South African government speeches (ZA-gov-multilingual), as well as the South African Government newspaper (Vuk’uzenzele), that are translated into all 11 South African official languages. The corpora can be used for a myriad of downstream NLP tasks. The corpora were created to allow researchers to study the language used in South African government publications, with a focus on understanding how South African government officials communicate with their constituents. In this paper we highlight the process of gathering, cleaning and making available the corpora. We create parallel sentence corpora for Neural Machine Translation tasks using Language-Agnostic Sentence Representations (LASER) embeddings. With these aligned sentences we then provide NMT benchmarks for 9 indigenous languages by fine-tuning massively multilingual pre-trained language model.