Nombuyiselo Zondi


Fixing paper assignments

  1. Please select all papers that do not belong to this person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2025

pdf bib
AfroCS-xs: Creating a Compact, High-Quality, Human-Validated Code-Switched Dataset for African Languages
Kayode Olaleye | Arturo Oncevay | Mathieu Sibue | Nombuyiselo Zondi | Michelle Terblanche | Sibongile Mapikitla | Richard Lastrucci | Charese Smiley | Vukosi Marivate
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Code-switching is prevalent in multilingual communities but lacks adequate high-quality data for model development, especially for African languages. To address this, we present AfroCS-xs, a small human-validated synthetic code-switched dataset for four African languages (Afrikaans, Sesotho, Yoruba, isiZulu) and English within a specific domain—agriculture. Using large language models (LLMs), we generate code-switched sentences, including English translations, that are rigorously validated and corrected by native speakers. As a downstream evaluation task, we use this dataset to fine-tune different instruction-tuned LLMs for code-switched translation and compare their performance against machine translation (MT) models. Our results demonstrate that LLMs consistently improve in translation accuracy when fine-tuned on the high-quality AfroCS-xs dataset, highlighting that substantial gains can still be made with a low volume of data. We also observe improvements on natural code-switched and out-of-domain (personal finance) test sets. Overall, regardless of data size and prior exposure to a language, LLMs benefit from higher quality training data when translating code-switched texts in under-represented languages.