Oumaima Attafi


Fixing paper assignments

  1. Please select all papers that do not belong to this person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2025

pdf bib
Shawarma Chats: A Benchmark Exact Dialogue & Evaluation Platter in Egyptian, Maghrebi & Modern Standard Arabic—A Triple-Dialect Feast for Hungry Language Models
Kamyar Zeinalipour | Mohamed Zaky Saad | Oumaima Attafi | Marco Maggini | Marco Gori
Proceedings of The Third Arabic Natural Language Processing Conference

Content-grounded dialogue evaluation for Arabic remains under-resourced, particularly across Modern Standard (MSA), Egyptian, and Maghrebi varieties. We introduce Shawarma Chats, a benchmark of 30,000 six-turn conversations grounded in Wikipedia content, evenly split across the three dialects. To build this corpus, we prompt five frontier LLMs GPT-4o, Gemini 2.5 Flash, Qwen-Plus, DeepSeek-Chat, and Mistral Large to generate 1,500 seed dialogues. Native Arabic speakers evaluate these outputs to select the most effective generator and most human-aligned grader. Sub-A dialogues undergo a two-pass, rationale-driven self-repair loop where the grader critiques and the generator revises; unresolved cases are manually corrected. We apply this pipeline to 10,000 Wikipedia paragraphs to create 30,000 high-quality conversations 10,000 per dialect—at modest human cost. To validate the benchmark, we LoRA-fine-tune six open LLMs (1–24 B parameters) on Shawarma Chats and observe consistent gains in automatic-grader scores, BERTScore, BLEU and ROUGE particularly for models larger than 7 B parameters. Shawarma Chats thus establishes the first large-scale, dialect-aware, content-grounded dialogue benchmark for Arabic.