Abraham Lin
2026
The African Languages Lab: A Collaborative Approach to Advancing Low-Resource African NLP
Sheriff Issaka | Keyi Wang | Yinka Ajibola | Oluwatumininu Samuel-Ipaye | Zhaoyi Zhang | Nicte Aguillon Jimenez | Evans Kofi Agyei | Abraham Lin | Rohan Ramachandran | Sadick Abdul Mumin | Faith Nchifor | Mohammed Shuraim Issah | Erick Rosas Gonzalez | Lieqi Liu | Sylvester Kpei | Jemimah Kusi Osei | Carlene Ajeneza | Persis Boateng | Prisca Adwoa Dufie Yeboah | Saadia Gabriel
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Sheriff Issaka | Keyi Wang | Yinka Ajibola | Oluwatumininu Samuel-Ipaye | Zhaoyi Zhang | Nicte Aguillon Jimenez | Evans Kofi Agyei | Abraham Lin | Rohan Ramachandran | Sadick Abdul Mumin | Faith Nchifor | Mohammed Shuraim Issah | Erick Rosas Gonzalez | Lieqi Liu | Sylvester Kpei | Jemimah Kusi Osei | Carlene Ajeneza | Persis Boateng | Prisca Adwoa Dufie Yeboah | Saadia Gabriel
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite representing nearly one-third of the world’s languages, African languages remain critically underserved by modern NLP technologies, with 88% classified as severely underrepresented or completely ignored in computational linguistics. We present the African Languages Lab (All Lab), a comprehensive research initiative that addresses this technological gap through systematic data collection, model development, and empirical analysis. Our contributions include: (1) a quality-controlled data collection pipeline, yielding the largest validated African multi-modal speech and text dataset spanning 40 languages with 19 billion text tokens and 12,628 hours of aligned speech data; (2) extensive experimental validation demonstrating that even modest-scale models, when fine-tuned on targeted language data, achieve substantial improvements over untrained baselines, averaging +23.69 ChrF++, +0.33 COMET, and +15.34 BLEU points across 31 evaluated languages; and (3) a comparative analysis against Google Translate in which a 1B-parameter model matched or surpassed the commercial system in several languages including Yoruba and Twi, revealing that data scarcity, rather than model scale, constitutes the primary bottleneck for low-resource NLP, and suggesting that systematic dataset development yields disproportionate returns for low-resource languages.