RSC: A Romanian Read Speech Corpus for Automatic Speech Recognition

Alexandru-Lucian Georgescu; Horia Cucu; Andi Buzo; Corneliu Burileanu

RSC: A Romanian Read Speech Corpus for Automatic Speech Recognition

Alexandru-Lucian Georgescu, Horia Cucu, Andi Buzo, Corneliu Burileanu

Abstract

Although many efforts have been made in the last decade to enhance the speech and language resources for Romanian, this language is still considered under-resourced. While for many other languages there are large speech corpora available for research and commercial applications, for Romanian language the largest publicly available corpus to date comprises less than 50 hours of speech. In this context, Speech and Dialogue research group releases Read Speech Corpus (RSC) – a Romanian speech corpus developed in-house, comprising 100 hours of speech recordings from 164 different speakers. The paper describes the development of the corpus and presents baseline automatic speech recognition (ASR) results using state-of-the-art ASR technology: Kaldi speech recognition toolkit.

Anthology ID:: 2020.lrec-1.814
Volume:: Proceedings of the 12th Language Resources and Evaluation Conference
Month:: May
Year:: 2020
Address:: Marseille, France
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 6606–6612
Language:: English
URL:: https://aclanthology.org/2020.lrec-1.814
DOI:
Bibkey:
Cite (ACL):: Alexandru-Lucian Georgescu, Horia Cucu, Andi Buzo, and Corneliu Burileanu. 2020. RSC: A Romanian Read Speech Corpus for Automatic Speech Recognition. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 6606–6612, Marseille, France. European Language Resources Association.
Cite (Informal):: RSC: A Romanian Read Speech Corpus for Automatic Speech Recognition (Georgescu et al., LREC 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/update-css-js/2020.lrec-1.814.pdf

PDF Cite Search