Samrómur: Crowd-sourcing large amounts of data

Staffan Hedström, David Erik Mollberg, Ragnheiður Þórhallsdóttir, Jón Guðnason


Abstract
This contribution describes the collection of a large and diverse corpus for speech recognition and similar tools using crowd-sourced donations. We have built a collection platform inspired by Mozilla Common Voice and specialized it to our needs. We discuss the importance of engaging the community and motivating it to contribute, in our case through competitions. Given the incentive and a platform to easily read in large amounts of utterances, we have observed four cases of speakers freely donating over 10 thousand utterances. We have also seen that women are keener to participate in these events throughout all age groups. Manually verifying a large corpus is a monumental task and we attempt to automatically verify parts of the data using tools like Marosijo and the Montreal Forced Aligner. The method proved helpful, especially for detecting invalid utterances and halving the work needed from crowd-sourced verification.
Anthology ID:
2022.lrec-1.247
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
2311–2316
Language:
URL:
https://aclanthology.org/2022.lrec-1.247
DOI:
Bibkey:
Cite (ACL):
Staffan Hedström, David Erik Mollberg, Ragnheiður Þórhallsdóttir, and Jón Guðnason. 2022. Samrómur: Crowd-sourcing large amounts of data. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2311–2316, Marseille, France. European Language Resources Association.
Cite (Informal):
Samrómur: Crowd-sourcing large amounts of data (Hedström et al., LREC 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2022.lrec-1.247.pdf