Mostafa Shahin
Also published as: M. Shahin
2026
AusKidTalk: Developing Transcription Guidelines for Continuous Australian English Child Speech
Tuende Szalay | Zheng Nan | Renata Huang | Mostafa Shahin | Sirojan Tharmakulasingam | Kirrie Ballard | Beena Ahmed
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Tuende Szalay | Zheng Nan | Renata Huang | Mostafa Shahin | Sirojan Tharmakulasingam | Kirrie Ballard | Beena Ahmed
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Guidelines are required for accurate and consistent transcription of speech corpora, especially when they contain more challenging, e.g. spontaneous or under-resourced speech. This paper presents a workflow and guidelines for transcribing spontaneous and under-resourced child speech in AusKidTalk, the first Australian English child corpus. Speech samples were elicited using a story-telling task and are 3.5 minutes long per child on average. Orthographic transcriptions were generated using automatic speech recognition (ASR) tools and corrected manually. A novel hand-correction protocol consisting of guidelines, hand-correction interface, and ground truth transcriptions together with consistency metrics were developed. Nine annotators submitted hand-corrections for 261 children’s story-telling task, and 25 ground truth tasks. Manual correction was 11-fold of speech time with a 3.5-minute-long story-telling task corrected in approximately 40 minutes. Efficiency is attributed to the quality of automatic transcription with 23% word error rate. Manual correction was accurate with annotators achieving consistent results on 15/25 ground truth submissions. Most inconsistent ground truth submissions were caused by a single, challenging ground truth task. These results show that our workflow yields efficient and accurate transcriptions, although transcriptions of potentially more challenging narrative tasks (e.g., elicited from younger children) might require further corrections.
2025
Iqra’Eval: A Shared Task on Qur’anic Pronunciation Assessment
Yassine El Kheir | Amit Meghanani | Hawau Olamide Toyin | Nada Almarwani | Omnia Ibrahim | Yousseif Ahmed Elshahawy | Mostafa Shahin | Ahmed Ali
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks
Yassine El Kheir | Amit Meghanani | Hawau Olamide Toyin | Nada Almarwani | Omnia Ibrahim | Yousseif Ahmed Elshahawy | Mostafa Shahin | Ahmed Ali
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks
2006
Building Annotated Written and Spoken Arabic LRs in NEMLAR Project
M. Yaseen | M. Attia | B. Maegaard | K. Choukri | N. Paulsson | S. Haamid | S. Krauwer | C. Bendahman | H. Fersøe | M. Rashwan | B. Haddad | C. Mukbel | A. Mouradi | A. Al-Kufaishi | M. Shahin | N. Chenfour | A. Ragheb
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
M. Yaseen | M. Attia | B. Maegaard | K. Choukri | N. Paulsson | S. Haamid | S. Krauwer | C. Bendahman | H. Fersøe | M. Rashwan | B. Haddad | C. Mukbel | A. Mouradi | A. Al-Kufaishi | M. Shahin | N. Chenfour | A. Ragheb
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
The NEMLAR project: Network for Euro-Mediterranean LAnguage Resource and human language technology development and support (www.nemlar.org) was a project supported by the EC with partners from Europe and Arabic countries, whose objective is to build a network of specialized partners to promote and support the development of Arabic Language Resources (LRs) in the Mediterranean region. The project focused on identifying the state of the art of LRs in the region, assessing priority requirements through consultations with language industry and communication players, and establishing a protocol for developing and identifying a Basic Language Resource Kit (BLARK) for Arabic, and to assess first priority requirements. The BLARK is defined as the minimal set of language resources that is necessary to do any pre-competitive research and education, in addition to the development of crucial components for any future NLP industry. Following the identification of high priority resources the NEMLAR partners agreed to focus on, and produce three main resources, which are 1) Annotated Arabic written corpus of about 500 K words, 2) Arabic speech corpus for TTS applications of 2x5 hours, and 3) Arabic broadcast news speech corpus of 40 hours Modern Standard Arabic. For each of the resources underlying linguistic models and assumptions of the corpus, technical specifications, methodologies for the collection and building of the resources, validation and verification mechanisms were put and applied for the three LRs.
Search
Fix author
Co-authors
- Beena Ahmed 1
- Adil Al-Kufaishi 1
- Ahmed Ali 1
- Nada Almarwani 1
- Mohamed Attia 1
- Kirrie Ballard 1
- Chomicha Bendahman 1
- Noureddine Chenfour 1
- Khalid Choukri 1
- Yassine El Kheir 1
- Yousseif Ahmed Elshahawy 1
- Hanne Fersøe 1
- Salah Haamid 1
- Bassam Haddad 1
- Renata Huang 1
- Omnia Ibrahim 1
- Steven Krauwer 1
- Bente Maegaard 1
- Amit Meghanani 1
- Abdelhak Mouradi 1
- Chafic Mukbel 1
- Zheng Nan 1
- Niklas Paulsson 1
- Ahmed Ragheb 1
- Mohsen Rashwan 1
- Tuende Szalay 1
- Sirojan Tharmakulasingam 1
- Hawau Olamide Toyin 1
- Mustafa Yaseen 1