Wessel Kraaij


FuzzyBIO: A Proposal for Fuzzy Representation of Discontinuous Entities
Anne Dirkson | Suzan Verberne | Wessel Kraaij
Proceedings of the 12th International Workshop on Health Text Mining and Information Analysis

Discontinuous entities pose a challenge to named entity recognition (NER). These phenomena occur commonly in the biomedical domain. As a solution, expansions of the BIO representation scheme that can handle these entity types are commonly used (i.e. BIOHD). However, the extra tag types make the NER task more difficult to learn. In this paper we propose an alternative; a fuzzy continuous BIO scheme (FuzzyBIO). We focus on the task of Adverse Drug Response extraction and normalization to compare FuzzyBIO to BIOHD. We find that FuzzyBIO improves recall of NER for two of three data sets and results in a higher percentage of correctly identified disjoint and composite entities for all data sets. Using FuzzyBIO also improves end-to-end performance for continuous and composite entities in two of three data sets. Since FuzzyBIO improves performance for some data sets and the conversion from BIOHD to FuzzyBIO is straightforward, we recommend investigating which is more effective for any data set containing discontinuous entities.


pdf bib
Conversation-Aware Filtering of Online Patient Forum Messages
Anne Dirkson | Suzan Verberne | Wessel Kraaij
Proceedings of the Fifth Social Media Mining for Health Applications Workshop & Shared Task

Previous approaches to NLP tasks on online patient forums have been limited to single posts as units, thereby neglecting the overarching conversational structure. In this paper we explore the benefit of exploiting conversational context for filtering posts relevant to a specific medical topic. We experiment with two approaches to add conversational context to a BERT model: a sequential CRF layer and manually engineered features. Although neither approach can outperform the F1 score of the BERT baseline, we find that adding a sequential layer improves precision for all target classes whereas adding a non-sequential layer with manually engineered features leads to a higher recall for two out of three target classes. Thus, depending on the end goal, conversation-aware modelling may be beneficial for identifying relevant messages. We hope our findings encourage other researchers in this domain to move beyond studying messages in isolation towards more discourse-based data collection and classification. We release our code for the purpose of follow-up research.


pdf bib
Lexical Normalization of User-Generated Medical Text
Anne Dirkson | Suzan Verberne | Wessel Kraaij
Proceedings of the Fourth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task

In the medical domain, user-generated social media text is increasingly used as a valuable complementary knowledge source to scientific medical literature. The extraction of this knowledge is complicated by colloquial language use and misspellings. Yet, lexical normalization of such data has not been addressed properly. This paper presents an unsupervised, data-driven spelling correction module for medical social media. Our method outperforms state-of-the-art spelling correction and can detect mistakes with an F0.5 of 0.888. Additionally, we present a novel corpus for spelling mistake detection and correction on a medical patient forum.


Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval
Wessel Kraaij | Jian-Yun Nie | Michel Simard
Computational Linguistics, Volume 29, Number 3, September 2003: Special Issue on the Web as Corpus


Ambiguity resolution and the retrieval of idioms: two approaches
Erik-Jan van der Linden | Wessel Kraaij
COLING 1990 Volume 2: Papers presented to the 13th International Conference on Computational Linguistics