Abstract
Finding informative COVID-19 posts in a stream of tweets is very useful to monitor health-related updates. Prior work focused on a balanced data setup and on English, but informative tweets are rare, and English is only one of the many languages spoken in the world. In this work, we introduce a new dataset of 5,000 tweets for finding informative COVID-19 tweets for Danish. In contrast to prior work, which balances the label distribution, we model the problem by keeping its natural distribution. We examine how well a simple probabilistic model and a convolutional neural network (CNN) perform on this task. We find a weighted CNN to work well but it is sensitive to embedding and hyperparameter choices. We hope the contributed dataset is a starting point for further work in this direction.- Anthology ID:
- 2021.wnut-1.2
- Volume:
- Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)
- Month:
- November
- Year:
- 2021
- Address:
- Online
- Venue:
- WNUT
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 11–19
- Language:
- URL:
- https://aclanthology.org/2021.wnut-1.2
- DOI:
- 10.18653/v1/2021.wnut-1.2
- Cite (ACL):
- Benjamin Olsen and Barbara Plank. 2021. Finding the needle in a haystack: Extraction of Informative COVID-19 Danish Tweets. In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pages 11–19, Online. Association for Computational Linguistics.
- Cite (Informal):
- Finding the needle in a haystack: Extraction of Informative COVID-19 Danish Tweets (Olsen & Plank, WNUT 2021)
- PDF:
- https://preview.aclanthology.org/auto-file-uploads/2021.wnut-1.2.pdf