Dagnachew Mekonnen Marilign
2025
Amharic News Topic Classification: Dataset and Transformer-Based Model Benchmarks
Dagnachew Mekonnen Marilign
|
Eyob Nigussie Alemu
Proceedings of the 9th Widening NLP Workshop
News classification is a downstream task in Natural Language Processing (NLP) that involves the automatic categorization of news articles into predefined thematic categories. Although notable advancements have been made for high-resource languages, low-resource languages such as Amharic continue to encounter significant challenges, largely due to the scarcity of annotated corpora and the limited availability of language-specific, state-of-the-art model adaptations. To address these limitations, this study significantly expands an existing Amharic news dataset, increasing its size from 50,000 to 144,000 articles, thus enriching the linguistic and topical diversity available for the model training and evaluation. Using this expanded dataset, we systematically evaluated the performance of five transformer-based models: mBERT, XLM-R, DistilBERT, AfriBERTa, and AfroXLM in the context of Amharic news classification. Among these, AfriBERTa and XLM-R achieved the highest F1-scores of 90.25% and 90.11%, respectively, establishing a new performance baseline for the task. These findings underscore the efficacy of advanced multilingual and Africa-centric transformer architectures when applied to under-resourced languages, and further emphasize the critical importance of large-scale, high-quality datasets in enabling robust model generalization. This study offers a robust empirical foundation for advancing NLP research in low-resource languages, which remain underrepresented in current NLP resources and methodologies.