KT-Speech-Crawler: Automatic Dataset Construction for Speech Recognition from YouTube Videos

Egor Lakomkin; Sven Magg; Cornelius Weber; Stefan Wermter

doi:10.18653/v1/D18-2016

KT-Speech-Crawler: Automatic Dataset Construction for Speech Recognition from YouTube Videos

Egor Lakomkin, Sven Magg, Cornelius Weber, Stefan Wermter

Abstract

We describe KT-Speech-Crawler: an approach for automatic dataset construction for speech recognition by crawling YouTube videos. We outline several filtering and post-processing steps, which extract samples that can be used for training end-to-end neural speech recognition systems. In our experiments, we demonstrate that a single-core version of the crawler can obtain around 150 hours of transcribed speech within a day, containing an estimated 3.5% word error rate in the transcriptions. Automatically collected samples contain reading and spontaneous speech recorded in various conditions including background noise and music, distant microphone recordings, and a variety of accents and reverberation. When training a deep neural network on speech recognition, we observed around 40% word error rate reduction on the Wall Street Journal dataset by integrating 200 hours of the collected samples into the training set.

Anthology ID:: D18-2016
Volume:: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Month:: November
Year:: 2018
Address:: Brussels, Belgium
Editors:: Eduardo Blanco, Wei Lu
Venue:: EMNLP
SIG:: SIGDAT
Publisher:: Association for Computational Linguistics
Note:
Pages:: 90–95
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/D18-2016/
DOI:: 10.18653/v1/D18-2016
Bibkey:
Cite (ACL):: Egor Lakomkin, Sven Magg, Cornelius Weber, and Stefan Wermter. 2018. KT-Speech-Crawler: Automatic Dataset Construction for Speech Recognition from YouTube Videos. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 90–95, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):: KT-Speech-Crawler: Automatic Dataset Construction for Speech Recognition from YouTube Videos (Lakomkin et al., EMNLP 2018)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/D18-2016.pdf
Code: EgorLakomkin/KTSpeechCrawler

PDF Cite Search Code Fix data