An Extension of the Slovak Broadcast News Corpus based on Semi-Automatic Annotation

Peter Viszlay, Ján Staš, Tomáš Koctúr, Martin Lojka, Jozef Juhár


Abstract
In this paper, we introduce an extension of our previously released TUKE-BNews-SK corpus based on a semi-automatic annotation scheme. It firstly relies on the automatic transcription of the BN data performed by our Slovak large vocabulary continuous speech recognition system. The generated hypotheses are then manually corrected and completed by trained human annotators. The corpus is composed of 25 hours of fully-annotated spontaneous and prepared speech. In addition, we have acquired 900 hours of another BN data, part of which we plan to annotate semi-automatically. We present a preliminary corpus evaluation that gives very promising results.
Anthology ID:
L16-1743
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
4684–4687
Language:
URL:
https://aclanthology.org/L16-1743
DOI:
Bibkey:
Cite (ACL):
Peter Viszlay, Ján Staš, Tomáš Koctúr, Martin Lojka, and Jozef Juhár. 2016. An Extension of the Slovak Broadcast News Corpus based on Semi-Automatic Annotation. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 4684–4687, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
An Extension of the Slovak Broadcast News Corpus based on Semi-Automatic Annotation (Viszlay et al., LREC 2016)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/L16-1743.pdf