2025
pdf
bib
abs
GAIfE: Using GenAI to Improve Literacy in Low-resourced Settings
Allahsera Auguste Tapo
|
Nouhoum Coulibaly
|
Seydou Diallo
|
Sebastien Diarra
|
Christopher M Homan
|
Mamadou K. Keita
|
Michael Leventhal
Findings of the Association for Computational Linguistics: NAACL 2025
Illiteracy is a predictor of many negative social and personal outcomes. Illiteracy rates are particularly high in countries with underresourced languages, where few books exist that are suitable for children to learn to read from. We present GAIfE (Generative AI for Education), a toolchain and workflow developed through empirical methods, that demonstrates how existing tools can be adapted to address low literacy for an underresourced language. We used GAIfE (a play on the Bambara word for “book”) to construct materials for developing children’s reading competence in Bambara, the vehicular language of Mali. Our approach to the generation and post-generation editing of content skewed by the Global-North-centric bias of available LLMs, enabled us to rapidly multiply the content in Bambara available online by 10 times while maintaining high standards of attractiveness of the material to maintain high engagement, accurate representation of the Malian culture and physical and social environment and language quality. Using our materials, pilot reading programs achieved a 67% reduction in the number of children unable to read Bambara. Our approach demonstrated the power of bias-aware application of generative AI to the problem domain as well as the potential impact the application of this technology could have on reducing illiteracy and improving learning outcomes through native language education.
pdf
bib
abs
SMOL: Professionally Translated Parallel Data for 115 Under-represented Languages
Isaac Caswell
|
Elizabeth Nielsen
|
Jiaming Luo
|
Colin Cherry
|
Geza Kovacs
|
Hadar Shemtov
|
Partha Talukdar
|
Dinesh Tewari
|
Baba Mamadi Diane
|
Djibrila Diane
|
Solo Farabado Cissé
|
Koulako Moussa Doumbouya
|
Edoardo Ferrante
|
Alessandro Guasoni
|
Christopher Homan
|
Mamadou K. Keita
|
Sudhamoy DebBarma
|
Ali Kuzhuget
|
David Anugraha
|
Muhammad Ravi Shulthan Habibi
|
Sina Ahmadi
|
Anthony Munthali
|
Jonathan Mingfei Liu
|
Jonathan Eng
Proceedings of the Tenth Conference on Machine Translation
We open-source SMOL (Set of Maximal Over-all Leverage), a suite of training data to un-lock machine translation for low-resource languages (LRLs). SMOL has been translated into123 under-resourced languages (125 language pairs), including many for which there exist no previous public resources, for a total of 6.1M translated tokens. SMOL comprises two sub-datasets, each carefully chosen for maximum impact given its size: SMOLSENT, a set of sentences chosen for broad unique token coverage, and SMOLDOC, a document-level source focusing on a broad topic coverage. They join the already released GATITOS for a trifecta of paragraph, sentence, and token-level content. We demonstrate that using SMOL to prompt or fine-tune Large Language Models yields robust chrF improvements. In addition to translation, we provide factuality ratings and rationales for all documents in SMOLDOC, yielding the first factuality datasets for most of these languages.