Samrómur Children: An Icelandic Speech Corpus

Carlos Daniel Hernandez Mena, David Erik Mollberg, Michal Borský, Jón Guðnason


Abstract
Samrómur Children is an Icelandic speech corpus intended for the field of automatic speech recognition. It contains 131 hours of read speech from Icelandic children aged between 4 to 17 years. The test portion was meticulously selected to cover a wide range of ages as possible; we aimed to have exactly the same amount of data per age range. The speech was collected with the crowd-sourcing platform Samrómur.is, which is inspired on the “Mozilla’s Common Voice Project”. The corpus was developed within the framework of the “Language Technology Programme for Icelandic 2019 − 2023”; the goal of the project is to make Icelandic available in language-technology applications. Samrómur Children is the first corpus in Icelandic with children’s voices for public use under a Creative Commons license. Additionally, we present baseline experiments and results using Kaldi.
Anthology ID:
2022.lrec-1.105
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
995–1002
Language:
URL:
https://aclanthology.org/2022.lrec-1.105
DOI:
Bibkey:
Cite (ACL):
Carlos Daniel Hernandez Mena, David Erik Mollberg, Michal Borský, and Jón Guðnason. 2022. Samrómur Children: An Icelandic Speech Corpus. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 995–1002, Marseille, France. European Language Resources Association.
Cite (Informal):
Samrómur Children: An Icelandic Speech Corpus (Hernandez Mena et al., LREC 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-2023-videos/2022.lrec-1.105.pdf
Data
Common Voice