Muneera Alhoshan
2026
Saudi ASWAT: A Large-Scale Corpus of Spontaneous Saudi Arabic Speech
Abdullah I. Alharbi | Afrah A. Altamimi | Muneera Alhoshan | Amal Almazrua | Halah Munif Alharbi | Bayan M. Almuqhim | Hawra Aljasim | Abdulrahman Alosaimy | Yahya A. Asiri | Abdullah Alfaifi
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Abdullah I. Alharbi | Afrah A. Altamimi | Muneera Alhoshan | Amal Almazrua | Halah Munif Alharbi | Bayan M. Almuqhim | Hawra Aljasim | Abdulrahman Alosaimy | Yahya A. Asiri | Abdullah Alfaifi
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Spontaneous Arabic speech is scarce in current corpora, and it is not well represented. This poses a limitation invisibility of spontaneous Arabic to automatic speech recognition (ASR), speaker diarization, and sociolinguistic research. The Saudi ASWAT project fills a major gap by creating the first nationwide corpus of natural Saudi speech, where data has been recorded and transcribed under a systematic methodology and ecologically valid conditions. The corpus aims to collect 2,500 hours of natural conversations from a diverse range of participants. These has been selected from five major Saudi regional varieties, Najdi (Central), Eastern, Hijazi (Western), Northern, and Southern, covering more than fifty five local varieties. Speech has been recorded by trained fieldworkers using participants own devices to reflect real-life variation. The annotated data incorporate a variety of speaker demographics, regional vocabularies which differ from the standard lexicon, and structured metadata. TF–IDF profiling shows regional differences in a range of performing words. Data also represent balanced age and gender sampling to support studies of intergenerational and sociophonetic variation. Saudi ASWAT provides the most linguistically diverse resources of Saudi Arabia to date. Additionally, it establishes an ethical governed framework for Arabic speech data creation to enable advances in both computational modeling and linguistic research.
Mu’jam Arriyadh: A Comprehensive Lexicon for Contemporary Arabic Language
Afrah A. Altamimi | Abdulrahman Alosaimy | Halah Munif Alharbi | Hawra Aljasim | Muneera Alhoshan | Amal Almazrua | Hanan Alharbi | Abdulrahman Saeed Alshehri | Bayan M. Almuqhim | Maryam H. Algarny | Yahya A. Asiri | Abdullah I. Alharbi | Saleh Zaidan Albalawi | Fawziah Mohammed Asiri | Sara Ali Alhifthi | Abdullah Alfaifi
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Afrah A. Altamimi | Abdulrahman Alosaimy | Halah Munif Alharbi | Hawra Aljasim | Muneera Alhoshan | Amal Almazrua | Hanan Alharbi | Abdulrahman Saeed Alshehri | Bayan M. Almuqhim | Maryam H. Algarny | Yahya A. Asiri | Abdullah I. Alharbi | Saleh Zaidan Albalawi | Fawziah Mohammed Asiri | Sara Ali Alhifthi | Abdullah Alfaifi
Proceedings of the Fifteenth Language Resources and Evaluation Conference
This paper provides an overview of Contemporary Arabic Lexicon (Mu’jam Arriyadh). It is a contemporary and inclusive Arabic dictionary that has been specifically developed to cater to the needs of both native and non-native Arabic speakers. The corpus utilized in this study is derived from the Arabic Contemporary Corpus for Analysis (ACCA), which encompasses a vast collection of 450 million words of Modern Standard Arabic spanning the previous century. Significantly, the lexicon in question prioritizes lemma-based entries over root forms, hence enhancing its user-friendliness and adaptability across different contexts. The resource offers comprehensive linguistic data pertaining to a wide array of Arabic vocabulary, encompassing morphological, morph-syntactic, and semantic aspects. The Lexicon has been developed in accordance with the ISO 24613 standard, which improves its ability to be processed by machines and facilitates the utilization of natural language processing systems. The database encompasses a range of linguistic aspects, such as synonyms, antonyms, and root forms, offering a comprehensive compilation. Mu’jam Arriyadh is a contemporary Arabic lexicon that is designed to be accessible to users, compatible with machine processing, and highly beneficial for anyone studying the language, conducting research, and utilizing natural language processing technologies.
2025
BALSAM: A Platform for Benchmarking Arabic Large Language Models
Rawan Al-Matham | Kareem Darwish | Raghad Al-Rasheed | Waad Alshammari | Muneera Alhoshan | Amal Almazrua | Asma Al Wazrah | Mais Alheraki | Firoj Alam | Preslav Nakov | Norah Alzahrani | Eman AlBilali | Nizar Habash | Abdelrahman El-Sheikh | Muhammad Elmallah | Haonan Li | Hamdy Mubarak | Mohamed Anwar | Zaid Alyafeai | Ahmed Abdelali | Nora Altwairesh | Maram Hasanain | Abdulmohsen Al Thubaity | Shady Shehata | Bashar Alhafni | Injy Hamed | Go Inoue | Khalid Elmadani | Ossama Obeid | Fatima Haouari | Tamer Elsayed | Emad Alghamdi | Khalid Almubarak | Saied Alshahrani | Ola Aljarrah | Safa Alajlan | Areej Alshaqarawi | Maryam Alshihri | Sultana Alghurabi | Atikah Alzeghayer | Afrah Altamimi | Abdullah Alfaifi | Abdulrahman AlOsaimy
Proceedings of The Third Arabic Natural Language Processing Conference
Rawan Al-Matham | Kareem Darwish | Raghad Al-Rasheed | Waad Alshammari | Muneera Alhoshan | Amal Almazrua | Asma Al Wazrah | Mais Alheraki | Firoj Alam | Preslav Nakov | Norah Alzahrani | Eman AlBilali | Nizar Habash | Abdelrahman El-Sheikh | Muhammad Elmallah | Haonan Li | Hamdy Mubarak | Mohamed Anwar | Zaid Alyafeai | Ahmed Abdelali | Nora Altwairesh | Maram Hasanain | Abdulmohsen Al Thubaity | Shady Shehata | Bashar Alhafni | Injy Hamed | Go Inoue | Khalid Elmadani | Ossama Obeid | Fatima Haouari | Tamer Elsayed | Emad Alghamdi | Khalid Almubarak | Saied Alshahrani | Ola Aljarrah | Safa Alajlan | Areej Alshaqarawi | Maryam Alshihri | Sultana Alghurabi | Atikah Alzeghayer | Afrah Altamimi | Abdullah Alfaifi | Abdulrahman AlOsaimy
Proceedings of The Third Arabic Natural Language Processing Conference
The impressive advancement of Large Language Models (LLMs) in English has not been matched across all languages. In particular, LLM performance in Arabic lags behind, due to data scarcity, linguistic diversity of Arabic and its dialects, morphological complexity, etc. Progress is further hindered by the quality of Arabic benchmarks, which typically rely on static, publicly available data, lack comprehensive task coverage, or do not provide dedicated platforms with blind test sets. This makes it challenging to measure actual progress and to mitigate data contamination. Here, we aim to bridge these gaps. In particular, we introduce BALSAM, a comprehensive, community-driven benchmark aimed at advancing Arabic LLM development and evaluation. It includes 78 NLP tasks from 14 broad categories, with 52K examples divided into 37K test and 15K development, and a centralized, transparent platform for blind evaluation. We envision BALSAM as a unifying platform that sets standards and promotes collaborative research to advance Arabic LLM capabilities.
Evaluating RAG Pipelines for Arabic Lexical Information Retrieval: A Comparative Study of Embedding and Generation Models
Raghad Al-Rasheed | Abdullah Al Muaddi | Hawra Aljasim | Rawan Al-Matham | Muneera Alhoshan | Asma Al Wazrah | Abdulrahman AlOsaimy
Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script
Raghad Al-Rasheed | Abdullah Al Muaddi | Hawra Aljasim | Rawan Al-Matham | Muneera Alhoshan | Asma Al Wazrah | Abdulrahman AlOsaimy
Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script
This paper investigates the effectiveness of retrieval-augmented generation (RAG) pipelines, focusing on the Arabic lexical information retrieval. Specifically, it analyzes how embedding models affect the recall of Arabic lexical information and evaluates the ability of large language models (LLMs) to produce accurate and contextually relevant answers within the RAG pipelines. We examine a dataset of over 88,000 words from the Riyadh dictionary and evaluate the models using metrics such as Top-K Recall, Mean Reciprocal Rank (MRR), F1 Score, Cosine Similarity, and Accuracy. The research assesses the capabilities of several embedding models, including E5-large, BGE, AraBERT, CAMeLBERT, and AraELECTRA, highlighting a disparity in performance between sentence embeddings and word embeddings. Sentence embedding with E5 achieved the best results, with a Top-5 Recall of 0.88, and an MRR of 0.48. For the generation models, we evaluated GPT-4, GPT-3.5, SILMA-9B, Gemini-1.5, Aya-8B, and AceGPT-13B based on their ability to generate accurate and contextually appropriate responses. GPT-4 demonstrated the best performance, achieving an F1 score of 0.90, an accuracy of 0.82, and a cosine similarity of 0.87. Our results emphasize the strengths and limitations of both embedding and generation models in Arabic tasks.
2024
KSAA-CAD Shared Task: Contemporary Arabic Dictionary for Reverse Dictionary and Word Sense Disambiguation
Waad Alshammari | Amal Almazrua | Asma Al Wazrah | Rawan Almatham | Muneera Alhoshan | Abdulrahman Alosaimy
Proceedings of the Second Arabic Natural Language Processing Conference
Waad Alshammari | Amal Almazrua | Asma Al Wazrah | Rawan Almatham | Muneera Alhoshan | Abdulrahman Alosaimy
Proceedings of the Second Arabic Natural Language Processing Conference
This paper outlines the KSAA-CAD shared task, highlighting the Contemporary Arabic Language Dictionary within the scenario of developing a Reverse Dictionary (RD) system and enhancing Word Sense Disambiguation (WSD) capabilities. The first KSAA-RD (Al-Matham et al., 2023) highlighted significant gaps in the domain of RDs, which are designed to retrieve words by their meanings or definitions. This shared task comprises two tasks: RD and WSD. The RD task focuses on identifying word embeddings that most accurately match a given definition, termed a “gloss,” in Arabic. Conversely, the WSD task involves determining the specific meaning of a word in context, particularly when the word has multiple meanings. The winning team achieved the highest-ranking score of 0.0644 in RD using Electra embeddings. In this paper, we describe the methods employed by the participating teams and provide insights into the future direction of KSAA-CAD.
Search
Fix author
Co-authors
- Abdulrahman AlOsaimy 5
- Amal Almazrua 4
- Asma Al Wazrah 3
- Abdullah Alfaifi 3
- Hawra Aljasim 3
- Rawan Al-Matham 2
- Raghad Al-Rasheed 2
- Abdullah I. Alharbi 2
- Halah Munif Alharbi 2
- Bayan M. Almuqhim 2
- Waad Thuwaini Alshammari 2
- Afrah A. Altamimi 2
- Yahya A. Asiri 2
- Ahmed Abdelali 1
- Abdullah Al Muaddi 1
- Abdulmohsen Al-Thubaity 1
- Safa Alajlan 1
- Firoj Alam 1
- Saleh Zaidan Albalawi 1
- Eman Albilali 1
- Maryam H. Algarny 1
- Emad Alghamdi 1
- Sultana Alghurabi 1
- Bashar Alhafni 1
- Hanan Alharbi 1
- Mais Alheraki 1
- Sara Ali Alhifthi 1
- Ola Aljarrah 1
- Rawan Almatham 1
- Khalid Almubarak 1
- Saied Alshahrani 1
- Areej Alshaqarawi 1
- Abdulrahman Saeed Alshehri 1
- Maryam Alshihri 1
- Afrah Altamimi 1
- Nora Altwairesh 1
- Zaid Alyafeai 1
- Norah A. Alzahrani 1
- Atikah Alzeghayer 1
- Mohamed Anwar 1
- Fawziah Mohammed Asiri 1
- Kareem Darwish 1
- Abdelrahman El-Sheikh 1
- Khalid Elmadani 1
- Muhammad Elmallah 1
- Tamer Elsayed 1
- Nizar Habash 1
- Injy Hamed 1
- Fatima Haouari 1
- Maram Hasanain 1
- Go Inoue 1
- Haonan Li 1
- Hamdy Mubarak 1
- Preslav Nakov 1
- Ossama Obeid 1
- Shady Shehata 1