Dietwig Lowet


2022

pdf
An exploratory data analysis: the performance differences of a medical code prediction system on different demographic groups
Heereen Shim | Dietwig Lowet | Stijn Luca | Bart Vanrumste
Proceedings of the 4th Clinical Natural Language Processing Workshop

Recent studies show that neural natural processing models for medical code prediction suffer from a label imbalance issue. This study aims to investigate further imbalance in a medical code prediction dataset in terms of demographic variables and analyse performance differences in demographic groups. We use sample-based metrics to correctly evaluate the performance in terms of the data subject. Also, a simple label distance metric is proposed to quantify the difference in the label distribution between a group and the entire data. Our analysis results reveal that the model performs differently towards different demographic groups: significant differences between age groups and between insurance types are observed. Interestingly, we found a weak positive correlation between the number of training data of the group and the performance of the group. However, a strong negative correlation between the label distance of the group and the performance of the group is observed. This result suggests that the model tends to perform poorly in the group whose label distribution is different from the global label distribution of the training data set. Further analysis of the model performance is required to identify the cause of these differences and to improve the model building.

2021

pdf
Synthetic Data Generation and Multi-Task Learning for Extracting Temporal Information from Health-Related Narrative Text
Heereen Shim | Dietwig Lowet | Stijn Luca | Bart Vanrumste
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

Extracting temporal information is critical to process health-related text. Temporal information extraction is a challenging task for language models because it requires processing both texts and numbers. Moreover, the fundamental challenge is how to obtain a large-scale training dataset. To address this, we propose a synthetic data generation algorithm. Also, we propose a novel multi-task temporal information extraction model and investigate whether multi-task learning can contribute to performance improvement by exploiting additional training signals with the existing training data. For experiments, we collected a custom dataset containing unstructured texts with temporal information of sleep-related activities. Experimental results show that utilising synthetic data can improve the performance when the augmentation factor is 3. The results also show that when multi-task learning is used with an appropriate amount of synthetic data, the performance can significantly improve from 82. to 88.6 and from 83.9 to 91.9 regarding micro-and macro-average exact match scores of normalised time prediction, respectively.

pdf
Building blocks of a task-oriented dialogue system in the healthcare domain
Heereen Shim | Dietwig Lowet | Stijn Luca | Bart Vanrumste
Proceedings of the Second Workshop on Natural Language Processing for Medical Conversations

There has been significant progress in dialogue systems research. However, dialogue systems research in the healthcare domain is still in its infancy. In this paper, we analyse recent studies and outline three building blocks of a task-oriented dialogue system in the healthcare domain: i) privacy-preserving data collection; ii) medical knowledge-grounded dialogue management; and iii) human-centric evaluations. To this end, we propose a framework for developing a dialogue system and show preliminary results of simulated dialogue data generation by utilising expert knowledge and crowd-sourcing.