SamróMur MilljóN: An ASR Corpus of One Million Verified Read Prompts in Icelandic

Carlos Daniel Hernandez Mena, Þorsteinn Daði Gunnarsson, Jon Gudnason


Abstract
The platform samromur.is, or “Samrómur” for short, is a crowdsourcing web application built on Mozilla’s Common Voice, designed to accumulate speech data for the advancement of language technologies in Icelandic. Over the years, Samrómur has proven to be remarkably successful in amassing a significant number of high-quality audio clips from thousands of users. However, the challenge of manually verifying the entirety of the collected data has hindered its effective exploitation, especially in the realm of Automatic Speech Recognition (ASR), its original purpose. In this paper, we introduce the “Samrómur Milljón” corpus, an ASR dataset comprising one million audio clips from Samrómur. These clips have been automatically verified using state-of-the-art speech recognition systems such as NeMo, Wav2Vec2, and Whisper. Additionally, we present the ASR results obtained from creating acoustic models based on Samrómur Milljón. These results demonstrate significant promise when compared to other acoustic models trained with a similar volume of Icelandic data from different sources.
Anthology ID:
2024.lrec-main.1246
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
14305–14312
Language:
URL:
https://aclanthology.org/2024.lrec-main.1246
DOI:
Bibkey:
Cite (ACL):
Carlos Daniel Hernandez Mena, Þorsteinn Daði Gunnarsson, and Jon Gudnason. 2024. SamróMur MilljóN: An ASR Corpus of One Million Verified Read Prompts in Icelandic. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 14305–14312, Torino, Italia. ELRA and ICCL.
Cite (Informal):
SamróMur MilljóN: An ASR Corpus of One Million Verified Read Prompts in Icelandic (Hernandez Mena et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/2024.lrec-main.1246.pdf