My Science Tutor (MyST)–a Large Corpus of Children’s Conversational Speech

Sameer Pradhan, Ronald A. Cole, Wayne H. Ward


Abstract
This article describes the [corpus-name] corpus developed as part of the [project-name] project. To the best of our knowledge, this is one of the largest collections of children’s conversational speech that is freely available for non-commercial use under the creative commons license (CC BY-NC-SA 4.0). It comprises approximately 400 hours of speech, spanning some 230K utterances spread across about 10,500 virtual tutor sessions. Roughly 1,300 third, fourth and fifth grade students contributed to this corpus. The current release contains roughly 100K transcribed utterances. It is our hope that the corpus can be used to improve automatic speech recognition models and algorithms. We report the word error rate achieved on the test set using a model trained on the training and development portion of the corpus. The git repository of the corpus contains the complete training and evaluation setup in order to facilitate a fair and consistent evaluation. It is our hope that this corpus will contribute to the creation and evaluation of conversational AI agents having a better understanding of children’s speech, potentially opening doors to novel, effective, learning and therapeutic interventions.
Anthology ID:
2024.lrec-main.1052
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
12040–12045
Language:
URL:
https://aclanthology.org/2024.lrec-main.1052
DOI:
Bibkey:
Cite (ACL):
Sameer Pradhan, Ronald A. Cole, and Wayne H. Ward. 2024. My Science Tutor (MyST)–a Large Corpus of Children’s Conversational Speech. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 12040–12045, Torino, Italia. ELRA and ICCL.
Cite (Informal):
My Science Tutor (MyST)–a Large Corpus of Children’s Conversational Speech (Pradhan et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-2/2024.lrec-main.1052.pdf