Guy De Pauw

2023

pdf abs
Advancing Topical Text Classification: A Novel Distance-Based Method with Contextual Embeddings
Andriy Kosar | Guy De Pauw | Walter Daelemans
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

This study introduces a new method for distance-based unsupervised topical text classification using contextual embeddings. The method applies and tailors sentence embeddings for distance-based topical text classification. This is achieved by leveraging the semantic similarity between topic labels and text content, and reinforcing the relationship between them in a shared semantic space. The proposed method outperforms a wide range of existing sentence embeddings on average by 35%. Presenting an alternative to the commonly used transformer-based zero-shot general-purpose classifiers for multiclass text classification, the method demonstrates significant advantages in terms of computational efficiency and flexibility, while maintaining comparable or improved classification results.

2015

2013

2012

Although in recent years numerous forms of Internet communication ― such as e-mail, blogs, chat rooms and social network environments ― have emerged, balanced corpora of Internet speech with trustworthy meta-information (e.g. age and gender) or linguistic annotations are still limited. In this paper we present a large corpus of Flemish Dutch chat posts that were collected from the Belgian online social network Netlog. For all of these posts we also acquired the users' profile information, making this corpus a unique resource for computational and sociolinguistic research. However, for analyzing such a corpus on a large scale, NLP tools are required for e.g. automatic POS tagging or lemmatization. Because many NLP tools fail to correctly analyze the surface forms of chat language usage, we propose to normalize this anomalous' input into a format suitable for existing NLP solutions for standard Dutch. Additionally, we have annotated a substantial part of the corpus (i.e. the Chatty subset) to provide a gold standard for the evaluation of future approaches to automatic (Flemish) chat language normalization.

pdf
Towards a Self-Learning Assistive Vocal Interface: Vocabulary and Grammar Learning
Janneke van de Loo | Jort F. Gemmeke | Guy De Pauw | Joris Driesen | Hugo Van hamme | Walter Daelemans
Proceedings of the 1st Workshop on Speech and Multimodal Interaction in Assistive Environments

2009

pdf bib
Proceedings of the First Workshop on Language Technologies for African Languages
Lori Levin | John Kiango | Judith Klavans | Guy De Pauw | Gilles-Maurice de Schryver | Peter Waiganjo Wagacha
Proceedings of the First Workshop on Language Technologies for African Languages

pdf bib
The SAWA Corpus: A Parallel Corpus English - Swahili
Guy De Pauw | Peter Waiganjo Wagacha | Gilles-Maurice de Schryver
Proceedings of the First Workshop on Language Technologies for African Languages

2006

pdf abs
A Grapheme-Based Approach for Accent Restoration in Gikuyu
Peter W. Wagacha | Guy De Pauw | Pauline W. Githinji
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The orthography of Gikuyu includes a number of accented characters to represent the entire vowel system. These characters are however not readily available on standard computer keyboards and are usually represented as the nearest available character. This can render reading and understanding written texts more difficult. This paper describes a system that is able to automatically place these accents in Gikuyu text on the basis of local graphemic context. This approach avoids the need for an extensive digital lexicon, typically not available for resource-scarce languages. Using an extended trigram based-approach, the experiments show that this method can achieve a very high accuracy even with a limited amount of digitally available textual data. The experiments on Gikuyu are contrasted with experiments on French, German and Dutch.

pdf abs
KNACK-2002: a Richly Annotated Corpus of Dutch Written Text
Véronique Hoste | Guy De Pauw
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper, we introduce the annotated KNACK-2002 corpus of Dutch written text. The corpus features five different annotation layers, ranging from the annotation of morphological boundaries at the word level, over the annotation of part-of-speech tags and phrase chunks at the syntactic level to the annotation of named entities at the semantic level and coreferential relations at the discourse level. We believe the corpus is unique in the Dutch language area because of its richness of annotation layers, providing researchers with a useful gold standard data set for different NLP tasks in the domains of morphology, (morpho)syntax, semantics and discourse.

pdf abs
A mixed word / morphological approach for extending CELEX for high coverage on contemporary large corpora
Joris Vaneyghen | Guy De Pauw | Dirk Van Compernolle | Walter Daelemans
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper describes an alternative approach to morphological language modeling, which incorporates constraints on the morphological production of new words. This is done by applying the constraints as a preprocessing step in which only one morphological production rule can be applied to an extended lexicon of knownmorphemes, lemmas and word forms. This approach is used to extend the CELEX Dutch morphological database, so that a higher coverage can be reached on a largecorpus of Dutch newspaper articles. We present experimental results on the coverage of this extended database and use the extension to further evaluate our morphologicalsystem, as well as the impact of the constraints on the coverage of out-of-vocabulary words.

2004

pdf
A Comparison of Two Different Approaches to Morphological Analysis of Dutch
Guy De Pauw | Tom Laureys | Walter Daelemans | Hugo Van hamme
Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology: Current Themes in Computational Phonology and Morphology

pdf abs
Evaluation and Adaptation of the Celex Dutch Morphological Database
Tom Laureys | Guy De Pauw | Hugo Van hamme | Walter Daelemans | Dirk Van Compernolle
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

This paper describes some important modifications to the Celex morphological database in the context of the FLaVoR project. FLaVoR aims to develop a novel modular framework for speech recognition, enabling the integration of complex linguistic knowledge sources, such as a morphological model. Morphology is a fairly unexploited linguistic information source speech recognizers could benefit from. This is especially true for languages which allow for a rich set of morphological operations, such as our target language Dutch. In this paper we focus on the exploitation of the Celex Dutch morphological database as the information source underlying two different morphological analyzers being developed within the project. Although the Celex database provides a valuable source of morphological information for Dutch, many modifications were necessary before it could be practically applied. We identify major problems, discuss the implemented solutions and finally experimentally evaluate the effect of our modifications to the database.