Halah Munif Alharbi
2026
Saudi ASWAT: A Large-Scale Corpus of Spontaneous Saudi Arabic Speech
Abdullah I. Alharbi | Afrah A. Altamimi | Muneera Alhoshan | Amal Almazrua | Halah Munif Alharbi | Bayan M. Almuqhim | Hawra Aljasim | Abdulrahman Alosaimy | Yahya A. Asiri | Abdullah Alfaifi
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Abdullah I. Alharbi | Afrah A. Altamimi | Muneera Alhoshan | Amal Almazrua | Halah Munif Alharbi | Bayan M. Almuqhim | Hawra Aljasim | Abdulrahman Alosaimy | Yahya A. Asiri | Abdullah Alfaifi
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Spontaneous Arabic speech is scarce in current corpora, and it is not well represented. This poses a limitation invisibility of spontaneous Arabic to automatic speech recognition (ASR), speaker diarization, and sociolinguistic research. The Saudi ASWAT project fills a major gap by creating the first nationwide corpus of natural Saudi speech, where data has been recorded and transcribed under a systematic methodology and ecologically valid conditions. The corpus aims to collect 2,500 hours of natural conversations from a diverse range of participants. These has been selected from five major Saudi regional varieties, Najdi (Central), Eastern, Hijazi (Western), Northern, and Southern, covering more than fifty five local varieties. Speech has been recorded by trained fieldworkers using participants own devices to reflect real-life variation. The annotated data incorporate a variety of speaker demographics, regional vocabularies which differ from the standard lexicon, and structured metadata. TF–IDF profiling shows regional differences in a range of performing words. Data also represent balanced age and gender sampling to support studies of intergenerational and sociophonetic variation. Saudi ASWAT provides the most linguistically diverse resources of Saudi Arabia to date. Additionally, it establishes an ethical governed framework for Arabic speech data creation to enable advances in both computational modeling and linguistic research.
Mu’jam Arriyadh: A Comprehensive Lexicon for Contemporary Arabic Language
Afrah A. Altamimi | Abdulrahman Alosaimy | Halah Munif Alharbi | Hawra Aljasim | Muneera Alhoshan | Amal Almazrua | Hanan Alharbi | Abdulrahman Saeed Alshehri | Bayan M. Almuqhim | Maryam H. Algarny | Yahya A. Asiri | Abdullah I. Alharbi | Saleh Zaidan Albalawi | Fawziah Mohammed Asiri | Sara Ali Alhifthi | Abdullah Alfaifi
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Afrah A. Altamimi | Abdulrahman Alosaimy | Halah Munif Alharbi | Hawra Aljasim | Muneera Alhoshan | Amal Almazrua | Hanan Alharbi | Abdulrahman Saeed Alshehri | Bayan M. Almuqhim | Maryam H. Algarny | Yahya A. Asiri | Abdullah I. Alharbi | Saleh Zaidan Albalawi | Fawziah Mohammed Asiri | Sara Ali Alhifthi | Abdullah Alfaifi
Proceedings of the Fifteenth Language Resources and Evaluation Conference
This paper provides an overview of Contemporary Arabic Lexicon (Mu’jam Arriyadh). It is a contemporary and inclusive Arabic dictionary that has been specifically developed to cater to the needs of both native and non-native Arabic speakers. The corpus utilized in this study is derived from the Arabic Contemporary Corpus for Analysis (ACCA), which encompasses a vast collection of 450 million words of Modern Standard Arabic spanning the previous century. Significantly, the lexicon in question prioritizes lemma-based entries over root forms, hence enhancing its user-friendliness and adaptability across different contexts. The resource offers comprehensive linguistic data pertaining to a wide array of Arabic vocabulary, encompassing morphological, morph-syntactic, and semantic aspects. The Lexicon has been developed in accordance with the ISO 24613 standard, which improves its ability to be processed by machines and facilitates the utilization of natural language processing systems. The database encompasses a range of linguistic aspects, such as synonyms, antonyms, and root forms, offering a comprehensive compilation. Mu’jam Arriyadh is a contemporary Arabic lexicon that is designed to be accessible to users, compatible with machine processing, and highly beneficial for anyone studying the language, conducting research, and utilizing natural language processing technologies.