Marijn Huijbregts


2014

pdf bib
Semi-automatic annotation of the UCU accents speech corpus
Rosemary Orr | Marijn Huijbregts | Roeland van Beek | Lisa Teunissen | Kate Backhouse | David van Leeuwen
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Annotation and labeling of speech tasks in large multitask speech corpora is a necessary part of preparing a corpus for distribution. We address three approaches to annotation and labeling: manual, semi automatic and automatic procedures for labeling the UCU Accent Project speech data, a multilingual multitask longitudinal speech corpus. Accuracy and minimal time investment are the priorities in assessing the efficacy of each procedure. While manual labeling based on aural and visual input should produce the most accurate results, this approach is error-prone because of its repetitive nature. A semi automatic event detection system requiring manual rejection of false alarms and location and labeling of misses provided the best results. A fully automatic system could not be applied to entire speech recordings because of the variety of tasks and genres. However, it could be used to annotate separate sentences within a specific task. Acoustic confidence measures can correctly detect sentences that do not match the text with an EER of 3.3%

2008

pdf bib
Evaluation of Spoken Document Retrieval for Historic Speech Collections
Willemijn Heeren | Franciska de Jong | Laurens van der Werff | Marijn Huijbregts | Roeland Ordelman
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The re-use of spoken word audio collections maintained by audiovisual archives is severely hindered by their generally limited access. The CHoral project, which is part of the CATCH program funded by the Dutch Research Council, aims to provide users of speech archives with online, instead of on-location, access to relevant fragments, instead of full documents. To meet this goal, a spoken document retrieval framework is being developed. In this paper the evaluation efforts undertaken so far to assess and improve various aspects of the framework are presented. These efforts include (i) evaluation of the automatically generated textual representations of the spoken word documents that enable word-based search, (ii) the development of measures to estimate the quality of the textual representations for use in information retrieval, and (iii) studies to establish the potential user groups of the to-be-developed technology, and the first versions of the user interface supporting online access to spoken word collections.