Alexandru-Lucian Georgescu
2020
RSC: A Romanian Read Speech Corpus for Automatic Speech Recognition
Alexandru-Lucian Georgescu
|
Horia Cucu
|
Andi Buzo
|
Corneliu Burileanu
Proceedings of the Twelfth Language Resources and Evaluation Conference
Although many efforts have been made in the last decade to enhance the speech and language resources for Romanian, this language is still considered under-resourced. While for many other languages there are large speech corpora available for research and commercial applications, for Romanian language the largest publicly available corpus to date comprises less than 50 hours of speech. In this context, Speech and Dialogue research group releases Read Speech Corpus (RSC) – a Romanian speech corpus developed in-house, comprising 100 hours of speech recordings from 164 different speakers. The paper describes the development of the corpus and presents baseline automatic speech recognition (ASR) results using state-of-the-art ASR technology: Kaldi speech recognition toolkit.