Mostafa Saeed

2025

pdf bib abs
Lemmatizing Dialectal Arabic with Sequence-to-Sequence Models
Mostafa Saeed | Nizar Habash
Proceedings of The Third Arabic Natural Language Processing Conference

Lemmatization for dialectal Arabic poses many challenges due to the lack of orthographic standards and limited morphological analyzers. This work explores the effectiveness of Seq2Seq models for lemmatizing dialectal Arabic, both without analyzers and with their integration. We assess how well these models generalize across dialects and benefit from related varieties. Focusing on Egyptian, Gulf, and Levantine dialects with varying resource levels, our analysis highlights both the potential and limitations of data-driven approaches. The proposed method achieves significant gains over baselines, performing well in both low-resource and dialect-rich scenarios.

pdf bib
AMAR at BAREC Shared Task 2025: Arabic Meta-learner for Assessing Readability
Mostafa Saeed | Rana Waly | Abdelaziz Ashraf Hussein
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks

pdf bib abs
Lemmatization as a Classification Task: Results from Arabic across Multiple Genres
Mostafa Saeed | Nizar Habash
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Lemmatization is crucial for NLP tasks in morphologically rich languages with ambiguous orthography like Arabic, but existing tools face challenges due to inconsistent standards and limited genre coverage. This paper introduces two novel approaches that frame lemmatization as classification into a Lemma-POS-Gloss (LPG) tagset, leveraging machine translation and semantic clustering. We also present a new Arabic lemmatization test set covering diverse genres, standardized alongside existing datasets. We evaluate character-level sequence-to-sequence models, which perform competitively and offer complementary value, but are limited to lemma prediction (not LPG) and prone to hallucinating implausible forms. Our results show that classification and clustering yield more robust, interpretable outputs, setting new benchmarks for Arabic lemmatization.

Co-authors

Venues

arabicnlp2
emnlp1

Fix author