Dirk Van Compernolle


A mixed word / morphological approach for extending CELEX for high coverage on contemporary large corpora
Joris Vaneyghen | Guy De Pauw | Dirk Van Compernolle | Walter Daelemans
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper describes an alternative approach to morphological language modeling, which incorporates constraints on the morphological production of new words.This is done by applying the constraints as a preprocessing step in which only one morphological production rule can be applied to an extended lexicon of knownmorphemes, lemmas and word forms. This approach is used to extend the CELEX Dutch morphological database, so that a higher coverage can be reached on a largecorpus of Dutch newspaper articles. We present experimental results on the coverage of this extended database and use the extension to further evaluate our morphologicalsystem, as well as the impact of the constraints on the coverage of out-of-vocabulary words.


Evaluation and Adaptation of the Celex Dutch Morphological Database
Tom Laureys | Guy De Pauw | Hugo Van hamme | Walter Daelemans | Dirk Van Compernolle
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

This paper describes some important modifications to the Celex morphological database in the context of the FLaVoR project. FLaVoR aims to develop a novel modular framework for speech recognition, enabling the integration of complex linguistic knowledge sources, such as a morphological model. Morphology is a fairly unexploited linguistic information source speech recognizers could benefit from. This is especially true for languages which allow for a rich set of morphological operations, such as our target language Dutch. In this paper we focus on the exploitation of the Celex Dutch morphological database as the information source underlying two different morphological analyzers being developed within the project. Although the Celex database provides a valuable source of morphological information for Dutch, many modifications were necessary before it could be practically applied. We identify major problems, discuss the implemented solutions and finally experimentally evaluate the effect of our modifications to the database.

Automatic Phonemic Labeling and Segmentation of Spoken Dutch
Kris Demuynck | Tom Laureys | Patrick Wambacq | Dirk Van Compernolle
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

The CGN corpus (Corpus Gesproken Nederlands/Corpus Spoken Dutch) is a large speech corpus of contemporary Dutch as spoken in Belgium (3.3 million words) and in the Netherlands (5.6 million words). Due to its size, manual phonemic annotation was limited to 10% of the data and automatic systems were used to complement this data. This paper describes the automatic generation of the phonemic annotations and the corresponding segmentations. First, we detail the processes used to generate possible pronunciations for each sentence and to select to most likely one. Next, we identify the remaining difficulties when handling the CGN data and explain how we solved them. We conclude with an evaluation of the quality of the resulting transcriptions and segmentations.


A Structured Language Model Based on Context-Sensitive Probabilistic Left-Corner Parsing
Dong Hoon Van Uytsel | Filip Van Aelten | Dirk Van Compernolle
Second Meeting of the North American Chapter of the Association for Computational Linguistics