2025
pdf
bib
abs
ADOR: Dataset for Arabic Dialects in Hotel Reviews: A Human Benchmark for Sentiment Analysis
Maram I. Alharbi
|
Saad Ezzini
|
Hansi Hettiarachchi
|
Tharindu Ranasinghe
|
Ruslan Mitkov
Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages
Arabic machine translation remains a fundamentally challenging task, primarily due to the lack of comprehensive annotated resources. This study evaluates the performance of Meta’s NLLB-200 model in translating Modern Standard Arabic (MSA) into three regional dialects: Saudi, Maghribi, and Egyptian Arabic using a manually curated dataset of hotel reviews. We applied a multi-criteria human annotation framework to assess translation correctness, dialect accuracy, and sentiment and aspect preservation. Our analysis reveals significant variation in translation quality across dialects. While sentiment and aspect preservation were generally high, dialect accuracy and overall translation fidelity were inconsistent. For Saudi Arabic, over 95% of translations required human correction, highlighting systemic issues. Maghribi outputs demonstrated better dialectal authenticity, while Egyptian translations achieved the highest reliability with the lowest correction rate and fewest multi-criteria failures. These results underscore the limitations of current multilingual models in handling informal Arabic varieties and highlight the importance of dialect-sensitive evaluation.
pdf
bib
abs
Evaluating Large Language Models on Sentiment Analysis in Arabic Dialects
Maram I. Alharbi
|
Saad Ezzini
|
Hansi Hettiarachchi
|
Tharindu Ranasinghe
|
Ruslan Mitkov
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era
Despite recent progress in large language models (LLMs), their performance on Arabic dialects remains underexplored, particularly in the context of sentiment analysis. This study presents a comparative evaluation of three LLMs, DeepSeek-R1, Qwen2.5, and LLaMA-3, on sentiment classification across Modern Standard Arabic (MSA), Saudi dialect and Darija. We construct a balanced sentiment dataset by translating and validating MSA hotel reviews into Saudi dialect and Darija. Using parameter-efficient fine-tuning (LoRA) and dialect-specific prompts, we assess each model under matched and mismatched prompting conditions. Evaluation results show that Qwen2.5 achieves the highest macro F1 score of 79% on Darija input using MSA prompts, while DeepSeek performs best when prompted in the input dialect, reaching 71% on Saudi dialect. LLaMA-3 exhibits stable performance across prompt variations, with 75% macro F1 on Darija input under MSA prompting. Dialect-aware prompting consistently improves classification accuracy, particularly for neutral and negative sentiment classes.
pdf
bib
abs
AHaSIS: Shared Task on Sentiment Analysis for Arabic Dialects
Maram I. Alharbi
|
Salmane Chafik
|
Saad Ezzini
|
Ruslan Mitkov
|
Tharindu Ranasinghe
|
Hansi Hettiarachchi
Proceedings of the Shared Task on Sentiment Analysis for Arabic Dialects
The hospitality industry in the Arab world increasingly relies on customer feedback to shape services, driving the need for advanced Arabic sentiment analysis tools. To address this challenge, the Sentiment Analysis on Arabic Dialects in the Hospitality Domain shared task focuses on Sentiment Detection in Arabic Dialects. This task leverages a multi-dialect, manually curated dataset derived from hotel reviews originally written in Modern Standard Arabic (MSA) and translated into Saudi and Moroccan (Darija) dialects. The dataset consists of 538 sentiment-balanced reviews spanning positive, neutral, and negative categories. Translations were validated by native speakers to ensure dialectal accuracy and sentiment preservation. This resource supports the development of dialect-aware NLP systems for real-world applications in customer experience analysis. More than 40 teams have registered for the shared task, with 12 submitting systems during the evaluation phase. The top-performing system achieved an F1 score of 0.81, demonstrating the feasibility and ongoing challenges of sentiment analysis across Arabic dialects.