CEASR: A Corpus for Evaluating Automatic Speech Recognition

Malgorzata Anna Ulasik; Manuela Huerlimann; Fabian Germann; Esin Gedik; Fernando Benites; Mark Cieliebak

CEASR: A Corpus for Evaluating Automatic Speech Recognition

Malgorzata Anna Ulasik, Manuela Hürlimann, Fabian Germann, Esin Gedik, Fernando Benites, Mark Cieliebak

Abstract

In this paper, we present CEASR, a Corpus for Evaluating the quality of Automatic Speech Recognition (ASR). It is a data set based on public speech corpora, containing metadata along with transcripts generated by several modern state-of-the-art ASR systems. CEASR provides this data in a unified structure, consistent across all corpora and systems, with normalised transcript texts and metadata. We use CEASR to evaluate the quality of ASR systems by calculating an average Word Error Rate (WER) per corpus, per system and per corpus-system pair. Our experiments show a substantial difference in accuracy between commercial versus open-source ASR tools as well as differences up to a factor ten for single systems on different corpora. Using CEASR allowed us to very efficiently and easily obtain these results. Our corpus enables researchers to perform ASR-related evaluations and various in-depth analyses with noticeably reduced effort, i.e. without the need to collect, process and transcribe the speech data themselves.

Anthology ID:: 2020.lrec-1.798
Volume:: Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:: May
Year:: 2020
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 6477–6485
Language:: English
URL:: https://preview.aclanthology.org/nschneid-patch-2/2020.lrec-1.798/
DOI:
Bibkey:
Cite (ACL):: Malgorzata Anna Ulasik, Manuela Hürlimann, Fabian Germann, Esin Gedik, Fernando Benites, and Mark Cieliebak. 2020. CEASR: A Corpus for Evaluating Automatic Speech Recognition. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6477–6485, Marseille, France. European Language Resources Association.
Cite (Informal):: CEASR: A Corpus for Evaluating Automatic Speech Recognition (Ulasik et al., LREC 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-2/2020.lrec-1.798.pdf

PDF Cite Search Fix data