Saudi ASWAT: A Large-Scale Corpus of Spontaneous Saudi Arabic Speech

Abdullah I. Alharbi, Afrah A. Altamimi, Muneera Alhoshan, Amal Almazrua, Halah Munif Alharbi, Bayan M. Almuqhim, Hawra Aljasim, Abdulrahman Alosaimy, Yahya A. Asiri, Abdullah Alfaifi


Abstract
Spontaneous Arabic speech is scarce in current corpora, and it is not well represented. This poses a limitation invisibility of spontaneous Arabic to automatic speech recognition (ASR), speaker diarization, and sociolinguistic research. The Saudi ASWAT project fills a major gap by creating the first nationwide corpus of natural Saudi speech, where data has been recorded and transcribed under a systematic methodology and ecologically valid conditions. The corpus aims to collect 2,500 hours of natural conversations from a diverse range of participants. These has been selected from five major Saudi regional varieties, Najdi (Central), Eastern, Hijazi (Western), Northern, and Southern, covering more than fifty five local varieties. Speech has been recorded by trained fieldworkers using participants own devices to reflect real-life variation. The annotated data incorporate a variety of speaker demographics, regional vocabularies which differ from the standard lexicon, and structured metadata. TF–IDF profiling shows regional differences in a range of performing words. Data also represent balanced age and gender sampling to support studies of intergenerational and sociophonetic variation. Saudi ASWAT provides the most linguistically diverse resources of Saudi Arabia to date. Additionally, it establishes an ethical governed framework for Arabic speech data creation to enable advances in both computational modeling and linguistic research.
Anthology ID:
2026.lrec-main.124
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
1595–1602
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.124/
DOI:
Bibkey:
Cite (ACL):
Abdullah I. Alharbi, Afrah A. Altamimi, Muneera Alhoshan, Amal Almazrua, Halah Munif Alharbi, Bayan M. Almuqhim, Hawra Aljasim, Abdulrahman Alosaimy, Yahya A. Asiri, and Abdullah Alfaifi. 2026. Saudi ASWAT: A Large-Scale Corpus of Spontaneous Saudi Arabic Speech. International Conference on Language Resources and Evaluation, main:1595–1602.
Cite (Informal):
Saudi ASWAT: A Large-Scale Corpus of Spontaneous Saudi Arabic Speech (Alharbi et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.124.pdf