Mega-COV: A Billion-Scale Dataset of 100+ Languages for COVID-19

Muhammad Abdul-Mageed; Abdelrahim Elmadany; El Moatez Billah Nagoudi; Dinesh Pabbi; Kunal Verma; Rannie Lin

doi:10.18653/v1/2021.eacl-main.298

Mega-COV: A Billion-Scale Dataset of 100+ Languages for COVID-19

Muhammad Abdul-Mageed, AbdelRahim Elmadany, El Moatez Billah Nagoudi, Dinesh Pabbi, Kunal Verma, Rannie Lin

Abstract

We describe Mega-COV, a billion-scale dataset from Twitter for studying COVID-19. The dataset is diverse (covers 268 countries), longitudinal (goes as back as 2007), multilingual (comes in 100+ languages), and has a significant number of location-tagged tweets (~169M tweets). We release tweet IDs from the dataset. We also develop two powerful models, one for identifying whether or not a tweet is related to the pandemic (best F1=97%) and another for detecting misinformation about COVID-19 (best F1=92%). A human annotation study reveals the utility of our models on a subset of Mega-COV. Our data and models can be useful for studying a wide host of phenomena related to the pandemic. Mega-COV and our models are publicly available.

Anthology ID:: 2021.eacl-main.298
Volume:: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
Month:: April
Year:: 2021
Address:: Online
Editors:: Paola Merlo, Jorg Tiedemann, Reut Tsarfaty
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3402–3420
Language:
URL:: https://aclanthology.org/2021.eacl-main.298
DOI:: 10.18653/v1/2021.eacl-main.298
Bibkey:
Cite (ACL):: Muhammad Abdul-Mageed, AbdelRahim Elmadany, El Moatez Billah Nagoudi, Dinesh Pabbi, Kunal Verma, and Rannie Lin. 2021. Mega-COV: A Billion-Scale Dataset of 100+ Languages for COVID-19. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 3402–3420, Online. Association for Computational Linguistics.
Cite (Informal):: Mega-COV: A Billion-Scale Dataset of 100+ Languages for COVID-19 (Abdul-Mageed et al., EACL 2021)
Copy Citation:
PDF:: https://preview.aclanthology.org/emnlp22-frontmatter/2021.eacl-main.298.pdf
Code: UBC-NLP/megacov
Data: Mega-COV

PDF Search Code