Supervised Machine Learning based Amharic Text Complexity Classification Using Automatic Annotator Tool

Gebregziabihier Nigusie

Supervised Machine Learning based Amharic Text Complexity Classification Using Automatic Annotator Tool

Abstract

Understanding written content can vary significantly based on the linguistic complexity of the text. In the context of Amharic, a morphologically rich and low-resource language, the use of complex vocabulary and less frequent expressions often hinders understanding, particularly among readers with limited literacy skills. Such complexity poses challenges for both human comprehension and NLP applications. Addressing this complexity in Amharic is therefore important for text readability and accessibility. In this study, we developed a text complexity annotation tool using curated list of 1,113 complex Amharic terms. Utilizing this tool, we collected and annotated a dataset comprising 20,000 sentences. Based on the annotated corpus, we developed a text complexity classification model using both traditional and deep learning approaches. For traditional machine learning models, the dataset was vectorized using the Bag-of-Words representation. For deep learning and pre-trained models, we implemented embedding layers based on Word2Vec and BERT, trained on a vocabulary consisting of 24,148 tokens. The experiment is conducted using Support Vector Machine and Random Forest for classical machine learning, and Long Short-Term Memory, Bidirectional LSTM, and BERT for deep learning and pre-trained models. The classification accuracies achieved were 83.5% for SVM, 80.3% for RF, 84.1% for LSTM, 85.0% for BiLSTM, and 89.4% for the BERT-based model. Among these, the BERT-based approaches shows optimal performance for text complexity classifications which have abilityto capture long-range dependencies and contextual relationships within the text.

Anthology ID:: 2025.africanlp-1.2
Volume:: Proceedings of the Sixth Workshop on African Natural Language Processing (AfricaNLP 2025)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Constantine Lignos, Idris Abdulmumin, David Adelani
Venues:: AfricaNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7–14
Language:
URL:: https://preview.aclanthology.org/display_plenaries/2025.africanlp-1.2/
DOI:
Bibkey:
Cite (ACL):: Gebregziabihier Nigusie. 2025. Supervised Machine Learning based Amharic Text Complexity Classification Using Automatic Annotator Tool. In Proceedings of the Sixth Workshop on African Natural Language Processing (AfricaNLP 2025), pages 7–14, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Supervised Machine Learning based Amharic Text Complexity Classification Using Automatic Annotator Tool (Nigusie, AfricaNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/display_plenaries/2025.africanlp-1.2.pdf

PDF Cite Search Fix data