Wafa Al Ghallabi

Also published as: Wafa Al Ghallabi


2025

pdf bib
Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs
Wafa Al Ghallabi | Ritesh Thawkar | Sara Ghaboura | Ketan Pravin More | Omkar Thawakar | Hisham Cholakkal | Salman Khan | Rao Muhammad Anwer
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Arabic poetry stands as one of the most sophisticated and culturally embedded forms of expression in the Arabic language, known for its layered meanings, stylistic diversity, and deep historical continuity. Although large language models (LLMs) have demonstrated strong performance across languages and tasks, their ability to understand Arabic poetry remains largely unexplored. In this work, we introduce “Fann or Flop”, the first benchmark designed to assess the comprehension of Arabic poetry by LLMs in twelve historical eras, covering 21 core poetic genres and a variety of metrical forms, from classical structures to contemporary free verse. The benchmark comprises a curated corpus of poems with explanations that assess semantic understanding, metaphor interpretation, prosodic awareness, and cultural context. We argue that poetic comprehension offers a strong indicator for testing how good the LLM is in understanding classical Arabic through the Arabic poetry. Unlike surface-level tasks, this domain demands deeper interpretive reasoning and cultural sensitivity. Our evaluation of state-of-the-art LLMs shows that most models struggle with poetic understanding despite strong results on standard Arabic benchmarks. We release “Fann or Flop” along with the evaluation suite as an open-source resource to enable rigorous evaluation and advancement for Arabic-capable language models.

pdf bib
Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts
Sara Ghaboura | Ketan Pravin More | Ritesh Thawkar | Wafa Al Ghallabi | Omkar Thawakar | Fahad Shahbaz Khan | Hisham Cholakkal | Salman Khan | Rao Muhammad Anwer
Findings of the Association for Computational Linguistics: ACL 2025

Understanding historical and cultural artifacts demands human expertise and advanced computational techniques, yet the process remains complex and time-intensive. While large multimodal models offer promising support, their evaluation and improvement require a standardized benchmark. To address this, we introduce TimeTravel, a benchmark of 10,250 expert-verified samples spanning 266 distinct cultures across 10 major historical regions. Designed for AI-driven analysis of manuscripts, artworks, inscriptions, and archaeological discoveries, TimeTravel provides a structured dataset and robust evaluation framework to assess AI models’ capabilities in classification, interpretation, and historical comprehension. By integrating AI with historical research, TimeTravel fosters AI-powered tools for historians, archaeologists, researchers, and cultural tourists to extract valuable insights while ensuring technology contributes meaningfully to historical discovery and cultural heritage preservation. We evaluate contemporary AI models on TimeTravel, highlighting their strengths and identifying areas for improvement. Our goal is to establish AI as a reliable partner in preserving cultural heritage, ensuring that technological advancements contribute meaningfully to historical discovery. We release the TimeTravel dataset and evaluation suite as open-source resources for culturally and historically informed research.