Benjamin Rosman
2025
The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages
Jenalea Rajab
|
Anuoluwapo Aremu
|
Everlyn Asiko Chimoto
|
Dale Dunbar
|
Graham Morrissey
|
Fadel Thior
|
Luandrie Potgieter
|
Jessica Ojo
|
Atnafu Lambebo Tonja
|
Wilhelmina NdapewaOnyothi Nekoto
|
Pelonomi Moiloa
|
Jade Abbott
|
Vukosi Marivate
|
Benjamin Rosman
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This paper presents the Esethu Framework, a sustainable data curation framework specifically designed to empower local communities and ensure equitable benefit-sharing from their linguistic resource. This framework is supported by the Esethu license, a novel community-centric data license. As a proof of concept, we introduce the Vuk’uzenzele isiXhosa Speech Dataset (ViXSD), an open-source corpus developed under the Esethu Framework and License. The dataset, containing read speech from native isiXhosa speakers enriched with demographic and linguistic metadata, demonstrates how community-driven licensing and curation principles can bridge resource gaps in automatic speech recognition (ASR) for African languages while safeguarding the interests of data creators. We describe the framework guiding dataset development, outline the Esethu license provisions, present the methodology for ViXSD, and present ASR experiments validating ViXSD’s usability in building and refining voice-driven applications for isiXhosa.
2024
The Zeno’s Paradox of ‘Low-Resource’ Languages
Hellina Hailu Nigatu
|
Atnafu Lambebo Tonja
|
Benjamin Rosman
|
Thamar Solorio
|
Monojit Choudhury
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The disparity in the languages commonly studied in Natural Language Processing (NLP) is typically reflected by referring to languages as low vs high-resourced. However, there is limited consensus on what exactly qualifies as a ‘low-resource language.’ To understand how NLP papers define and study ‘low resource’ languages, we qualitatively analyzed 150 papers from the ACL Anthology and popular speech-processing conferences that mention the keyword ‘low-resource.’ Based on our analysis, we show how several interacting axes contribute to ‘low-resourcedness’ of a language and why that makes it difficult to track progress for each individual language. We hope our work (1) elicits explicit definitions of the terminology when it is used in papers and (2) provides grounding for the different axes to consider when connoting a language as low-resource.