Kris Demuynck


BioLORD: Learning Ontological Representations from Definitions for Biomedical Concepts and their Textual Descriptions
François Remy | Kris Demuynck | Thomas Demeester
Findings of the Association for Computational Linguistics: EMNLP 2022

This work introduces BioLORD, a new pre-training strategy for producing meaningful representations for clinical sentences and biomedical concepts. State-of-the-art methodologies operate by maximizing the similarity in representation of names referring to the same concept, and preventing collapse through contrastive learning. However, because biomedical names are not always self-explanatory, it sometimes results in non-semantic representations. BioLORD overcomes this issue by grounding its concept representations using definitions, as well as short descriptions derived from a multi-relational knowledge graph consisting of biomedical ontologies. Thanks to this grounding, our model produces more semantic concept representations that match more closely the hierarchical structure of ontologies. BioLORD establishes a new state of the art for text similarity on both clinical sentences (MedSTS) and biomedical concepts (MayoSRS).


Explaining Character-Aware Neural Networks for Word-Level Prediction: Do They Discover Linguistic Rules?
Fréderic Godin | Kris Demuynck | Joni Dambre | Wesley De Neve | Thomas Demeester
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Character-level features are currently used in different neural network-based natural language processing algorithms. However, little is known about the character-level patterns those models learn. Moreover, models are often compared only quantitatively while a qualitative analysis is missing. In this paper, we investigate which character-level patterns neural networks learn and if those patterns coincide with manually-defined word segmentations and annotations. To that end, we extend the contextual decomposition technique (Murdoch et al. 2018) to convolutional neural networks which allows us to compare convolutional neural networks and bidirectional long short-term memory networks. We evaluate and compare these models for the task of morphological tagging on three morphologically different languages and show that these models implicitly discover understandable linguistic rules.


SCALE: A Scalable Language Engineering Toolkit
Joris Pelemans | Lyan Verwimp | Kris Demuynck | Hugo Van hamme | Patrick Wambacq
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we present SCALE, a new Python toolkit that contains two extensions to n-gram language models. The first extension is a novel technique to model compound words called Semantic Head Mapping (SHM). The second extension, Bag-of-Words Language Modeling (BagLM), bundles popular models such as Latent Semantic Analysis and Continuous Skip-grams. Both extensions scale to large data and allow the integration into first-pass ASR decoding. The toolkit is open source, includes working examples and can be found on


Speech Recognition Web Services for Dutch
Joris Pelemans | Kris Demuynck | Hugo Van hamme | Patrick Wambacq
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we present 3 applications in the domain of Automatic Speech Recognition for Dutch, all of which are developed using our in-house speech recognition toolkit SPRAAK. The speech-to-text transcriber is a large vocabulary continuous speech recognizer, optimized for Southern Dutch. It is capable to select components and adjust parameters on the fly, based on the observed conditions in the audio and was recently extended with the capability of adding new words to the lexicon. The grapheme-to-phoneme converter generates possible pronunciations for Dutch words, based on lexicon lookup and linguistic rules. The speech-text alignment system takes audio and text as input and constructs a time aligned output where every word receives exact begin and end times. All three of the applications (and others) are freely available, after registration, as a web application on and in addition, can be accessed as a web service in automated tools.


A Speech Corpus for Modeling Language Acquisition: CAREGIVER
Toomas Altosaar | Louis ten Bosch | Guillaume Aimetti | Christos Koniaris | Kris Demuynck | Henk van den Heuvel
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

A multi-lingual speech corpus used for modeling language acquisition called CAREGIVER has been designed and recorded within the framework of the EU funded Acquisition of Communication and Recognition Skills (ACORNS) project. The paper describes the motivation behind the corpus and its design by relying on current knowledge regarding infant language acquisition. Instead of recording infants and children, the voices of their primary and secondary caregivers were captured in both infant-directed and adult-directed speech modes over four languages in a read speech manner. The challenges and methods applied to obtain similar prompts in terms of complexity and semantics across different languages, as well as the normalized recording procedures employed at different locations, is covered. The corpus contains nearly 66000 utterance based audio files spoken over a two-year period by 17 male and 17 female native speakers of Dutch, English, Finnish, and Swedish. An orthographical transcription is available for every utterance. Also, time-aligned word and phone annotations for many of the sub-corpora also exist. The CAREGIVER corpus will be published via ELRA.


Automatic Phonemic Labeling and Segmentation of Spoken Dutch
Kris Demuynck | Tom Laureys | Patrick Wambacq | Dirk Van Compernolle
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

The CGN corpus (Corpus Gesproken Nederlands/Corpus Spoken Dutch) is a large speech corpus of contemporary Dutch as spoken in Belgium (3.3 million words) and in the Netherlands (5.6 million words). Due to its size, manual phonemic annotation was limited to 10% of the data and automatic systems were used to complement this data. This paper describes the automatic generation of the phonemic annotations and the corresponding segmentations. First, we detail the processes used to generate possible pronunciations for each sentence and to select to most likely one. Next, we identify the remaining difficulties when handling the CGN data and explain how we solved them. We conclude with an evaluation of the quality of the resulting transcriptions and segmentations.


An Improved Algorithm for the Automatic Segmentation of Speech Corpora
Tom Laureys | Kris Demuynck | Jacques Duchateau | Patrick Wambacq
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

Word Segmentation in the Spoken Dutch Corpus
Jean-Pierre Martens | Diana Binnenpoorte | Kris Demuynck | Ruben Van Parys | Tom Laureys | Wim Goedertier | Jacques Duchateau
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)