Automatic Phonemic Labeling and Segmentation of Spoken Dutch
Kris Demuynck, Tom Laureys, Patrick Wambacq, Dirk Van Compernolle
Abstract
The CGN corpus (Corpus Gesproken Nederlands/Corpus Spoken Dutch) is a large speech corpus of contemporary Dutch as spoken in Belgium (3.3 million words) and in the Netherlands (5.6 million words). Due to its size, manual phonemic annotation was limited to 10% of the data and automatic systems were used to complement this data. This paper describes the automatic generation of the phonemic annotations and the corresponding segmentations. First, we detail the processes used to generate possible pronunciations for each sentence and to select to most likely one. Next, we identify the remaining difficulties when handling the CGN data and explain how we solved them. We conclude with an evaluation of the quality of the resulting transcriptions and segmentations.- Anthology ID:
- L04-1264
- Volume:
- Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
- Month:
- May
- Year:
- 2004
- Address:
- Lisbon, Portugal
- Editors:
- Maria Teresa Lino, Maria Francisca Xavier, Fátima Ferreira, Rute Costa, Raquel Silva
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2004/pdf/447.pdf
- DOI:
- Cite (ACL):
- Kris Demuynck, Tom Laureys, Patrick Wambacq, and Dirk Van Compernolle. 2004. Automatic Phonemic Labeling and Segmentation of Spoken Dutch. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal. European Language Resources Association (ELRA).
- Cite (Informal):
- Automatic Phonemic Labeling and Segmentation of Spoken Dutch (Demuynck et al., LREC 2004)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2004/pdf/447.pdf