Marwan Sayed
2026
LAILA: A Large Trait-Based Dataset for Arabic Automated Essay Scoring
May Bashendy | Walid Massoud | Sohaila Eltanbouly | Salam Albatarni | Marwan Sayed | Abrar Abir | Houda Bouamor | Tamer Elsayed
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
May Bashendy | Walid Massoud | Sohaila Eltanbouly | Salam Albatarni | Marwan Sayed | Abrar Abir | Houda Bouamor | Tamer Elsayed
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Automated Essay Scoring (AES) has gained increasing attention in recent years, yet research on Arabic AES remains limited due to the lack of publicly available datasets. To address this, we introduce LAILA, the largest publicly available Arabic AES dataset to date, comprising 7,859 essays annotated with holistic and trait-specific scores on seven dimensions: relevance, organization, vocabulary, style, development, mechanics, and grammar. We detail the dataset design, collection, and annotations, and provide benchmark results using state-of-the-art Arabic and English models in prompt-specific and cross-prompt settings. LAILA fills a critical need in Arabic AES research, supporting the development of robust scoring systems.
Is One Dataset Enough for Evaluation? Studying Generalizability of Automated Essay Scoring Models
Sohaila Eltanbouly | Marwan Sayed | Tamer Elsayed
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Sohaila Eltanbouly | Marwan Sayed | Tamer Elsayed
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Automated Essay Scoring (AES) has made significant advancements in writing assessment. Recently, cross-prompt AES has gained attention because of its focus on generalizing to unseen prompts. Despite the promise of these advancements, a critical question remains: how generalizable and robust are those models when applied to diverse datasets? This study assesses the generalizability of eight cross-prompt AES models across three different datasets. We employ two experimental setups: the within-dataset approach, where both training and testing occur on the same dataset, and the cross-dataset approach, which challenges the models by evaluating their performance on previously unseen datasets. The experimental results show significant performance inconsistencies, highlighting that relying on a single dataset is insufficient for building robust and generalizable AES systems.
2025
Feature Engineering is not Dead: A Step Towards State of the Art for Arabic Automated Essay Scoring
Marwan Sayed | Sohaila Eltanbouly | May Bashendy | Tamer Elsayed
Proceedings of The Third Arabic Natural Language Processing Conference
Marwan Sayed | Sohaila Eltanbouly | May Bashendy | Tamer Elsayed
Proceedings of The Third Arabic Natural Language Processing Conference
Automated Essay Scoring (AES) has shown significant advancements in educational assessment. However, under-resourced languages like Arabic have received limited attention. To bridge this gap and enable robust Arabic AES, this paper introduces the first publicly-available comprehensive set of engineered features tailored for Arabic AES, covering surface-level, readability, lexical, syntactic, and semantic features. Experiments are conducted on a dataset of 620 Arabic essays, each annotated with both holistic and trait-specific scores. Our findings demonstrate that the proposed feature set is effective across different models and competitive with recent NLP advances including LLMs, establishing the state-of-the-art performance and providing strong baselines for future Arabic AES research. Moroever, the resulting feature set offers a reusable and foundational resource, contributing towards the development of more effective Arabic AES systems.