Richard Lastrucci
2025
AfroCS-xs: Creating a Compact, High-Quality, Human-Validated Code-Switched Dataset for African Languages
Kayode Olaleye
|
Arturo Oncevay
|
Mathieu Sibue
|
Nombuyiselo Zondi
|
Michelle Terblanche
|
Sibongile Mapikitla
|
Richard Lastrucci
|
Charese Smiley
|
Vukosi Marivate
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Code-switching is prevalent in multilingual communities but lacks adequate high-quality data for model development, especially for African languages. To address this, we present AfroCS-xs, a small human-validated synthetic code-switched dataset for four African languages (Afrikaans, Sesotho, Yoruba, isiZulu) and English within a specific domain—agriculture. Using large language models (LLMs), we generate code-switched sentences, including English translations, that are rigorously validated and corrected by native speakers. As a downstream evaluation task, we use this dataset to fine-tune different instruction-tuned LLMs for code-switched translation and compare their performance against machine translation (MT) models. Our results demonstrate that LLMs consistently improve in translation accuracy when fine-tuned on the high-quality AfroCS-xs dataset, highlighting that substantial gains can still be made with a low volume of data. We also observe improvements on natural code-switched and out-of-domain (personal finance) test sets. Overall, regardless of data size and prior exposure to a language, LLMs benefit from higher quality training data when translating code-switched texts in under-represented languages.
2023
Preparing the Vuk’uzenzele and ZA-gov-multilingual South African multilingual corpora
Richard Lastrucci
|
Jenalea Rajab
|
Matimba Shingange
|
Daniel Njini
|
Vukosi Marivate
Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)
This paper introduces two multilingual government themed corpora in various South African languages. The corpora were collected by gathering South African government speeches (ZA-gov-multilingual), as well as the South African Government newspaper (Vuk’uzenzele), that are translated into all 11 South African official languages. The corpora can be used for a myriad of downstream NLP tasks. The corpora were created to allow researchers to study the language used in South African government publications, with a focus on understanding how South African government officials communicate with their constituents. In this paper we highlight the process of gathering, cleaning and making available the corpora. We create parallel sentence corpora for Neural Machine Translation tasks using Language-Agnostic Sentence Representations (LASER) embeddings. With these aligned sentences we then provide NMT benchmarks for 9 indigenous languages by fine-tuning massively multilingual pre-trained language model.
Search
Fix author
Co-authors
- Vukosi Marivate 2
- Sibongile Mapikitla 1
- Daniel Njini 1
- Kayode Olaleye 1
- Arturo Oncevay 1
- show all...