Ramsa: A Large Sociolinguistically Rich Emirati Arabic Speech Corpus for ASR and TTS

Rania Al-Sabbagh

Ramsa: A Large Sociolinguistically Rich Emirati Arabic Speech Corpus for ASR and TTS

Abstract

Ramsa is a developing 41-hour speech corpus of Emirati Arabic designed to support sociolinguistic research and low-resource language technologies. It contains recordings from structured interviews with native speakers and episodes from national television shows. The corpus features 157 speakers (59 female, 98 male), spans subdialects such as Urban, Bedouin, and Mountain/Shihhi, and covers topics such as cultural heritage, agriculture and sustainability, daily life, professional trajectories, and architecture. It consists of 91 monologic and 79 dialogic recordings, varying in length and recording conditions. A 10% subset was used to evaluate commercial and open-source models for automatic speech recognition (ASR) and text-to-speech (TTS) in a zero-shot setting to establish initial baselines. Whisper-large-v3-turbo achieved the best ASR performance, with average word and character error rates of 0.268 and 0.144, respectively. MMS-TTS-Ara reported the best mean word and character rates of 0.285 and 0.081, respectively, for TTS. These baselines are competitive but leave substantial room for improvement. The paper highlights the challenges encountered and provides directions for future work.

Anthology ID:: 2026.lrec-main.250
Volume:: Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:: May
Year:: 2026
Address:: Palma de Mallorca, Spain
Editors:: Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:: LREC
SIG:
Publisher:: ELRA Language Resource Association
Note:
Pages:: 3184–3198
Language:
URL:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.250/
DOI:
Bibkey:
Cite (ACL):: Rania Al-Sabbagh. 2026. Ramsa: A Large Sociolinguistically Rich Emirati Arabic Speech Corpus for ASR and TTS. International Conference on Language Resources and Evaluation, main:3184–3198.
Cite (Informal):: Ramsa: A Large Sociolinguistically Rich Emirati Arabic Speech Corpus for ASR and TTS (Al-Sabbagh, LREC 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.250.pdf

PDF Cite Search Fix data