Yacouba Diarra
2026
Dealing with the Hard Facts of Low-Resource African NLP
Michael Leventhal | Yacouba Diarra | Nouhoum Coulibaly | Panga Azazia Kamaté | Aymane Dembélé | Madani Amadou Tall | Emmanuel Elise Kone
Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026)
Michael Leventhal | Yacouba Diarra | Nouhoum Coulibaly | Panga Azazia Kamaté | Aymane Dembélé | Madani Amadou Tall | Emmanuel Elise Kone
Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026)
Creating speech datasets, models, and evaluation frameworks for low-resource languages remains challenging given the lack of a broad base of pertinent experience to draw from. This paper reports on the field collection of 612 hours of spontaneous speech in Bambara, a low-resource West African language; the semi-automated annotation of that dataset with transcriptions; the creation of several monolingual ultra-compact and small models using the dataset; and the automatic and human evaluation of their output. We offer practical suggestions for data collection protocols, annotation, and model design, as well as evidence for the importance of performing human evaluation. In addition to the main dataset, multiple evaluation datasets, models, and code are made publicly available.
Kunnafonidilaw ka Cadeau: an ASR dataset of present-day Bambara
Michael Leventhal | Yacouba Diarra | Nouhoum Coulibaly | Panga Azazia Kamaté
Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026)
Michael Leventhal | Yacouba Diarra | Nouhoum Coulibaly | Panga Azazia Kamaté
Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026)
We present Kunkado, a 160-hour Bambara ASR dataset compiled from Malian radio archives to capture present-day spontaneous speech across a wide range of topics. It includes code-switching, disfluencies, background noise, and overlapping speakers that practical ASR systems encounter in real-world use. We finetuned Parakeet-based models on a 33.47-hour human-reviewed subset and apply pragmatic transcript normalization to reduce variability in number formatting, tags, and code-switching annotations. Evaluated on two real-world test sets, finetuning with Kunkado reduces WER from 44.47% to 37.12% on one and from 36.07% to 32.33% on the other. In human evaluation, the resulting model also outperforms a comparable system with the same architecture trained on 98 hours of cleaner, less realistic speech. We release the data and models to support robust ASR for predominantly oral languages.
Where Are We at with Automatic Speech Recognition for the Bambara Language?
Seydou Diallo | Yacouba Diarra | Panga Azazia Kamaté | Aboubacar Ouattara | Mamadou K. Keita | Adam Bouno Kampo
Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026)
Seydou Diallo | Yacouba Diarra | Panga Azazia Kamaté | Aboubacar Ouattara | Mamadou K. Keita | Adam Bouno Kampo
Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026)
This paper introduces the first standardized benchmark for evaluating Automatic Speech Recognition (ASR) in the Bambara language, utilizing one hour of professionally recorded Malian constitutional text. Designed as a controlled reference set under near-optimal acoustic and linguistic conditions, the benchmark was used to evaluate 37 models, ranging from Bambara-trained systems to large-scale commercial models. Our findings reveal that current ASR performance remains significantly below deployment standards; the top-performing system in terms of Word Error Rate (WER) achieved 46.76% and the best Character Error Rate (CER) of 13.00% was set by another model, while several prominent multilingual models exceeded 100% WER due to severe hallucinations. These results suggest that multilingual pre-training and model scaling alone are insufficient for underrepresented languages. Furthermore, because this dataset represents a best-case scenario of the most simplified and formal form of spoken Bambara, these figures likely establish an upper bound for performance in practical, real-world settings. We provide the benchmark and an accompanying public leaderboard to facilitate transparent evaluation and future research in Bambara speech technology.
2023
Findings from the Bambara - French Machine Translation Competition (BFMT 2023)
Ninoh Agostinho Da Silva | Tunde Oluwaseyi Ajayi | Alexander Antonov | Panga Azazia Kamate | Moussa Coulibaly | Mason Del Rio | Yacouba Diarra | Sebastian Diarra | Chris Emezue | Joel Hamilcaro | Christopher M. Homan | Alexander Most | Joseph Mwatukange | Peter Ohue | Michael Pham | Abdoulaye Sako | Sokhar Samb | Yaya Sy | Tharindu Cyril Weerasooriya | Yacine Zahidi | Sarah Luger
Proceedings of the Sixth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2023)
Ninoh Agostinho Da Silva | Tunde Oluwaseyi Ajayi | Alexander Antonov | Panga Azazia Kamate | Moussa Coulibaly | Mason Del Rio | Yacouba Diarra | Sebastian Diarra | Chris Emezue | Joel Hamilcaro | Christopher M. Homan | Alexander Most | Joseph Mwatukange | Peter Ohue | Michael Pham | Abdoulaye Sako | Sokhar Samb | Yaya Sy | Tharindu Cyril Weerasooriya | Yacine Zahidi | Sarah Luger
Proceedings of the Sixth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2023)
Orange Silicon Valley hosted a low-resource machine translation (MT) competition with monetary prizes. The goals of the competition were to raise awareness of the challenges in the low-resource MT domain, improve MT algorithms and data strategies, and support MT expertise development in the regions where people speak Bambara and other low-resource languages. The participants built Bambara to French and French to Bambara machine translation systems using data provided by the organizers and additional data resources shared amongst the competitors. This paper details each team’s different approaches and motivation for ongoing work in Bambara and the broader low-resource machine translation domain.
Search
Fix author
Co-authors
- Panga Azazia Kamaté 4
- Nouhoum Coulibaly 2
- Michael Leventhal 2
- Ninoh Agostinho Da Silva 1
- Tunde Oluwaseyi Ajayi 1
- Alexander Antonov 1
- Moussa Coulibaly 1
- Mason Del Rio 1
- Aymane Dembélé 1
- Seydou Diallo 1
- Sebastian Diarra 1
- Chris Chinenye Emezue 1
- Joel Hamilcaro 1
- Christopher M. Homan 1
- Adam Bouno Kampo 1
- Mamadou K. Keita 1
- Emmanuel Elise Kone 1
- Sarah Luger 1
- Alexander Most 1
- Joseph Mwatukange 1
- Peter Ohue 1
- Aboubacar Ouattara 1
- Michael Pham 1
- Abdoulaye Sako 1
- Sokhar Samb 1
- Yaya Sy 1
- Madani Amadou Tall 1
- Tharindu Cyril Weerasooriya 1
- Yacine Zahidi 1