Khan Academy Corpus: A Multilingual Corpus of Khan Academy Lectures

Dominika Ďurišková, Daniela Jurášová, Matúš Žilinec, Eduard Šubert, Ondřej Bojar


Abstract
We present the Khan Academy Corpus totalling 10122 hours in 87394 recordings across 29 languages, where 43% of recordings (4252 hours) are equipped with human-written subtitles. The subtitle texts cover a total of 137 languages. The dataset was collected from open access Khan Academy lectures, benefiting from their manual transcripts and manual translations of the transcripts. The dataset can serve in creation or evaluation of multilingual speech recognition or translation systems, featuring a diverse set of subject domains.
Anthology ID:
2024.lrec-main.851
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
9743–9752
Language:
URL:
https://aclanthology.org/2024.lrec-main.851
DOI:
Bibkey:
Cite (ACL):
Dominika Ďurišková, Daniela Jurášová, Matúš Žilinec, Eduard Šubert, and Ondřej Bojar. 2024. Khan Academy Corpus: A Multilingual Corpus of Khan Academy Lectures. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9743–9752, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Khan Academy Corpus: A Multilingual Corpus of Khan Academy Lectures (Ďurišková et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-2/2024.lrec-main.851.pdf