2008
pdf
abs
Children’s Oral Reading Corpus (CHOREC): Description and Assessment of Annotator Agreement
Leen Cleuren
|
Jacques Duchateau
|
Pol Ghesquière
|
Hugo Van hamme
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Within the scope of the SPACE project, the CHildrens Oral REading Corpus (CHOREC) is developed. This database contains recorded, transcribed and annotated read speech (42 GB or 130 hours) of 400 Dutch speaking elementary school children with or without reading difficulties. Analyses of inter- and intra-annotator agreement are carried out in order to investigate the consistency with which reading errors are detected, orthographic and phonetic transcriptions are made, and reading errors and reading strategies are labeled. Percentage agreement scores and kappa values both show that agreement between annotations, and therefore the quality of the annotations, is high. Taken all double or triple annotations (for 10% resp. 30% of the corpus) together, % agreement varies between 86.4% and 98.6%, whereas kappa varies between 0.72 and 0.97 depending on the annotation tier that is being assessed. School type and reading type seem to account for systematic differences in % agreement, but these differences disappear when kappa values are calculated that correct for chance agreement. To conclude, an analysis of the annotation differences with respect to the *s label (i.e. a label that is used to annotate undistinguishable spelling behaviour), phoneme labels, reading strategy and error labels is given.
2004
pdf
abs
Use and Evaluation of Prosodic Annotations in Dutch
Jacques Duchateau
|
Tim Ceyssens
|
Hugo Van hamme
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
In the development of annotations for a spoken database, an important issue is whether the annotations can be generated automatically with sufficient precision, or whether expensive manual annotations are needed. In this paper, the case of prosodic annotations is discussed, which was investigated on the CGN database (Spoken Dutch Corpus). The main conclusions of this work are as follows. First, it was found that the available amount of manual prosodic annotations is sufficient for the development of our (baseline, decision tree based) prosodic models. In other words, more manual annotations do not improve the models. Second, the developed prosodic models for prominence are insufficiently accurate to produce automatic prominence annotations that are as good as the manual ones. But on the other hand the consistency between manual and automatic break annotations is as high as the inter-transcriber consistency for breaks. So given the current amount of manual break annotations, annotations for the remainder of the CGN database can be generated automatically with the same quality as the manual annotations.
2002
pdf
An Improved Algorithm for the Automatic Segmentation of Speech Corpora
Tom Laureys
|
Kris Demuynck
|
Jacques Duchateau
|
Patrick Wambacq
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)
pdf
Word Segmentation in the Spoken Dutch Corpus
Jean-Pierre Martens
|
Diana Binnenpoorte
|
Kris Demuynck
|
Ruben Van Parys
|
Tom Laureys
|
Wim Goedertier
|
Jacques Duchateau
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)