Doina Caragea


2021

pdf bib
Multi-task Learning to Enable Location Mention Identification in the Early Hours of a Crisis Event
Sarthak Khanal | Doina Caragea
Findings of the Association for Computational Linguistics: EMNLP 2021

Training a robust and reliable deep learning model requires a large amount of data. In the crisis domain, building deep learning models to identify actionable information from the huge influx of data posted by eyewitnesses of crisis events on social media, in a time-critical manner, is central for fast response and relief operations. However, building a large, annotated dataset to train deep learning models is not always feasible in a crisis situation. In this paper, we investigate a multi-task learning approach to concurrently leverage available annotated data for several related tasks from the crisis domain to improve the performance on a main task with limited annotated data. Specifically, we focus on using multi-task learning to improve the performance on the task of identifying location mentions in crisis tweets.

pdf bib
Stance Detection in COVID-19 Tweets
Kyle Glandt | Sarthak Khanal | Yingjie Li | Doina Caragea | Cornelia Caragea
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

The prevalence of the COVID-19 pandemic in day-to-day life has yielded large amounts of stance detection data on social media sites, as users turn to social media to share their views regarding various issues related to the pandemic, e.g. stay at home mandates and wearing face masks when out in public. We set out to make use of this data by collecting the stance expressed by Twitter users, with respect to topics revolving around the pandemic. We annotate a new stance detection dataset, called COVID-19-Stance. Using this newly annotated dataset, we train several established stance detection models to ascertain a baseline performance for this specific task. To further improve the performance, we employ self-training and domain adaptation approaches to take advantage of large amounts of unlabeled data and existing stance detection datasets. The dataset, code, and other resources are available on GitHub.

2020

pdf bib
Cross-Lingual Disaster-related Multi-label Tweet Classification with Manifold Mixup
Jishnu Ray Chowdhury | Cornelia Caragea | Doina Caragea
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Distinguishing informative and actionable messages from a social media platform like Twitter is critical for facilitating disaster management. For this purpose, we compile a multilingual dataset of over 130K samples for multi-label classification of disaster-related tweets. We present a masking-based loss function for partially labelled samples and demonstrate the effectiveness of Manifold Mixup in the text domain. Our main model is based on Multilingual BERT, which we further improve with Manifold Mixup. We show that our model generalizes to unseen disasters in the test set. Furthermore, we analyze the capability of our model for zero-shot generalization to new languages. Our code, dataset, and other resources are available on Github.

2010

pdf bib
KSU KDD: Word Sense Induction by Clustering in Topic Space
Wesam Elshamy | Doina Caragea | William Hsu
Proceedings of the 5th International Workshop on Semantic Evaluation