Kim Luyckx

2013

pdf
De-Identification of Clinical Free Text in Dutch with Limited Training Data: A Case Study
Elyne Scheurwegs | Kim Luyckx | Filip Van der Schueren | Tim Van den Bulcke
Proceedings of the Workshop on NLP for Medicine and Biology associated with RANLP 2013

2012

Although in recent years numerous forms of Internet communication ― such as e-mail, blogs, chat rooms and social network environments ― have emerged, balanced corpora of Internet speech with trustworthy meta-information (e.g. age and gender) or linguistic annotations are still limited. In this paper we present a large corpus of Flemish Dutch chat posts that were collected from the Belgian online social network Netlog. For all of these posts we also acquired the users' profile information, making this corpus a unique resource for computational and sociolinguistic research. However, for analyzing such a corpus on a large scale, NLP tools are required for e.g. automatic POS tagging or lemmatization. Because many NLP tools fail to correctly analyze the surface forms of chat language usage, we propose to normalize this anomalous' input into a format suitable for existing NLP solutions for standard Dutch. Additionally, we have annotated a substantial part of the corpus (i.e. the Chatty subset) to provide a gold standard for the evaluation of future approaches to automatic (Flemish) chat language normalization.

2008

pdf abs
Personae: a Corpus for Author and Personality Prediction from Text
Kim Luyckx | Walter Daelemans
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We present a new corpus for computational stylometry, more specifically authorship attribution and the prediction of author personality from text. Because of the large number of authors (145), the corpus will allow previously impossible studies of variation in features considered predictive for writing style. The innovative meta-information (personality profiles of the authors) associated with these texts allows the study of personality prediction, a not yet very well researched aspect of style. In this paper, we describe the contents of the corpus and show its use in both authorship attribution and personality prediction. We focus on features that have been proven useful in the field of author recognition. Syntactic features like part-of-speech n-grams are generally accepted as not being under the authors conscious control and therefore providing good clues for predicting gender or authorship. We want to test whether these features are helpful for personality prediction and authorship attribution on a large set of authors. Both tasks are approached as text categorization tasks. First a document representation is constructed based on feature selection from the linguistically analyzed corpus (using the Memory-Based Shallow Parser (MBSP)). These are associated with each of the 145 authors or each of the four components of the Myers-Briggs Type Indicator (Introverted-Extraverted, Sensing-iNtuitive, Thinking-Feeling, Judging-Perceiving). Authorship attribution on 145 authors achieves results around 50%-accuracy. Preliminary results indicate that the first two personality dimensions can be predicted fairly accurately.

pdf
Authorship Attribution and Verification with Many Authors and Limited Data
Kim Luyckx | Walter Daelemans
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf
CNTS: Memory-Based Learning of Generating Repeated References
Iris Hendrickx | Walter Daelemans | Kim Luyckx | Roser Morante | Vincent Van Asch
Proceedings of the Fifth International Natural Language Generation Conference