David van Leeuwen


2016

pdf
A Longitudinal Bilingual Frisian-Dutch Radio Broadcast Database Designed for Code-Switching Research
Emre Yilmaz | Maaike Andringa | Sigrid Kingma | Jelske Dijkstra | Frits van der Kuip | Hans Van de Velde | Frederik Kampstra | Jouke Algra | Henk van den Heuvel | David van Leeuwen
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present a new speech database containing 18.5 hours of annotated radio broadcasts in the Frisian language. Frisian is mostly spoken in the province Fryslan and it is the second official language of the Netherlands. The recordings are collected from the archives of Omrop Fryslan, the regional public broadcaster of the province Fryslan. The database covers almost a 50-year time span. The native speakers of Frisian are mostly bilingual and often code-switch in daily conversations due to the extensive influence of the Dutch language. Considering the longitudinal and code-switching nature of the data, an appropriate annotation protocol has been designed and the data is manually annotated with the orthographic transcription, speaker identities, dialect information, code-switching details and background noise/music information.

2014

pdf
Semi-automatic annotation of the UCU accents speech corpus
Rosemary Orr | Marijn Huijbregts | Roeland van Beek | Lisa Teunissen | Kate Backhouse | David van Leeuwen
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Annotation and labeling of speech tasks in large multitask speech corpora is a necessary part of preparing a corpus for distribution. We address three approaches to annotation and labeling: manual, semi automatic and automatic procedures for labeling the UCU Accent Project speech data, a multilingual multitask longitudinal speech corpus. Accuracy and minimal time investment are the priorities in assessing the efficacy of each procedure. While manual labeling based on aural and visual input should produce the most accurate results, this approach is error-prone because of its repetitive nature. A semi automatic event detection system requiring manual rejection of false alarms and location and labeling of misses provided the best results. A fully automatic system could not be applied to entire speech recordings because of the variety of tasks and genres. However, it could be used to annotate separate sentences within a specific task. Acoustic confidence measures can correctly detect sentences that do not match the text with an EER of 3.3%