Morphologically-informed Somali Lemmatization Corpus built with a Web-based Crowdsourcing Platform
Abdifatah Ahmed Gedi, Shafie Abdi Mohamed, Yusuf A. Yusuf, Muhidin A. Mohamed, Fuad Mire Hassan, Houssein A Assowe
Abstract
Lemmatization, which reduces words to their root forms, plays a key role in tasks such as information retrieval, text indexing, and machinelearning-based language models. However, a key research challenge for low-resourced languages such as the Somali is the lack of human-annotated lemmatization datasets and reliable ground truth to underpin accurate morphological analysis and training relevant NLP models. To address this problem, we developed the first large-scale, purpose-built Somali lemmatization lexicon, coupled with a crowdsourcing platform for ongoing expansion. The system leverages Somali’s agglutinative and derivational morphology, encompassing over5,584 root words and 78,629 derivative forms, each annotated with part-of-speech tags. For data validation purpose, we have devised a pilot lexicon-based lemmatizer integrated with rule-based logic to handle out-of-vocabulary terms. Evaluation on a 294-document corpuscovering news articles, social media posts, and short messages shows lemmatization accuracies of 51.27% for full articles, 44.14% forexcerpts, and 59.51% for short texts such as tweets. These results demonstrate that combining lexical resources, POS tagging, and rulebased strategies provides a robust and scalable framework for addressing morphological complexity in Somali and other low-resource languages- Anthology ID:
- 2026.africanlp-main.17
- Volume:
- Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026)
- Month:
- March
- Year:
- 2026
- Address:
- Rabat, Morocco
- Editors:
- Everlyn Asiko Chimoto, Constantine Lignos, Shamsuddeen Muhammad, Idris Abdulmumin, Clemencia Siro, David Ifeoluwa Adelani
- Venues:
- AfricaNLP | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 179–189
- Language:
- URL:
- https://preview.aclanthology.org/manual-author-scripts/2026.africanlp-main.17/
- DOI:
- Cite (ACL):
- Abdifatah Ahmed Gedi, Shafie Abdi Mohamed, Yusuf A. Yusuf, Muhidin A. Mohamed, Fuad Mire Hassan, and Houssein A Assowe. 2026. Morphologically-informed Somali Lemmatization Corpus built with a Web-based Crowdsourcing Platform. In Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026), pages 179–189, Rabat, Morocco. Association for Computational Linguistics.
- Cite (Informal):
- Morphologically-informed Somali Lemmatization Corpus built with a Web-based Crowdsourcing Platform (Gedi et al., AfricaNLP 2026)
- PDF:
- https://preview.aclanthology.org/manual-author-scripts/2026.africanlp-main.17.pdf