Samrómur Children: An Icelandic Speech Corpus
Carlos Daniel Hernandez Mena, David Erik Mollberg, Michal Borský, Jón Guðnason
Abstract
Samrómur Children is an Icelandic speech corpus intended for the field of automatic speech recognition. It contains 131 hours of read speech from Icelandic children aged between 4 to 17 years. The test portion was meticulously selected to cover a wide range of ages as possible; we aimed to have exactly the same amount of data per age range. The speech was collected with the crowd-sourcing platform Samrómur.is, which is inspired on the “Mozilla’s Common Voice Project”. The corpus was developed within the framework of the “Language Technology Programme for Icelandic 2019 − 2023”; the goal of the project is to make Icelandic available in language-technology applications. Samrómur Children is the first corpus in Icelandic with children’s voices for public use under a Creative Commons license. Additionally, we present baseline experiments and results using Kaldi.- Anthology ID:
- 2022.lrec-1.105
- Volume:
- Proceedings of the Thirteenth Language Resources and Evaluation Conference
- Month:
- June
- Year:
- 2022
- Address:
- Marseille, France
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 995–1002
- Language:
- URL:
- https://aclanthology.org/2022.lrec-1.105
- DOI:
- Cite (ACL):
- Carlos Daniel Hernandez Mena, David Erik Mollberg, Michal Borský, and Jón Guðnason. 2022. Samrómur Children: An Icelandic Speech Corpus. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 995–1002, Marseille, France. European Language Resources Association.
- Cite (Informal):
- Samrómur Children: An Icelandic Speech Corpus (Hernandez Mena et al., LREC 2022)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2022.lrec-1.105.pdf
- Data
- Common Voice