Empirical Evaluation of Pre-trained Language Models for Summarizing Moroccan Darija News Articles
Azzedine Aftiss, Salima Lamsiyah, Christoph Schommer, Said Ouatik El Alaoui
Abstract
Moroccan Dialect (MD), or “Darija,” is a primary spoken variant of Arabic in Morocco, yet remains underrepresented in Natural Language Processing (NLP) research, particularly in tasks like summarization. Despite a growing volume of MD textual data online, there is a lack of robust resources and NLP models tailored to handle the unique linguistic challenges posed by MD. In response, we introduce .MA_v2, an expanded version of the GOUD.MA dataset, containing over 50k articles with their titles across 11 categories. This dataset provides a more comprehensive resource for developing summarization models. We evaluate the application of large language models (LLMs) for MD summarization, utilizing both fine-tuning and zero-shot prompting with encoder-decoder and causal LLMs, respectively. Our findings demonstrate that an expanded dataset improves summarization performance and highlights the capabilities of recent LLMs in handling MD text. We open-source our dataset, fine-tuned models, and all experimental code, establishing a foundation for future advancements in MD NLP. We release the code at https://github.com/AzzedineAftiss/Moroccan-Dialect-Summarization.- Anthology ID:
- 2025.wacl-1.9
- Volume:
- Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4)
- Month:
- January
- Year:
- 2025
- Address:
- Abu Dhabi, UAE
- Editors:
- Saad Ezzini, Hamza Alami, Ismail Berrada, Abdessamad Benlahbib, Abdelkader El Mahdaouy, Salima Lamsiyah, Hatim Derrouz, Amal Haddad Haddad, Mustafa Jarrar, Mo El-Haj, Ruslan Mitkov, Paul Rayson
- Venues:
- WACL | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 77–85
- Language:
- URL:
- https://preview.aclanthology.org/add-emnlp-2024-awards/2025.wacl-1.9/
- DOI:
- Cite (ACL):
- Azzedine Aftiss, Salima Lamsiyah, Christoph Schommer, and Said Ouatik El Alaoui. 2025. Empirical Evaluation of Pre-trained Language Models for Summarizing Moroccan Darija News Articles. In Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4), pages 77–85, Abu Dhabi, UAE. Association for Computational Linguistics.
- Cite (Informal):
- Empirical Evaluation of Pre-trained Language Models for Summarizing Moroccan Darija News Articles (Aftiss et al., WACL 2025)
- PDF:
- https://preview.aclanthology.org/add-emnlp-2024-awards/2025.wacl-1.9.pdf