Multimodal Pipeline for Collection of Misinformation Data from Telegram

Jose Sosa, Serge Sharoff


Abstract
The paper presents the outcomes of AI-COVID19, our project aimed at better understanding of misinformation flow about COVID-19 across social media platforms. The specific focus of the study reported in this paper is on collecting data from Telegram groups which are active in promotion of COVID-related misinformation. Our corpus collected so far contains around 28 million words, from almost one million messages. Given that a substantial portion of misinformation flow in social media is spread via multimodal means, such as images and video, we have also developed a mechanism for utilising such channels via producing automatic transcripts for videos and automatic classification for images into such categories as memes, screenshots of posts and other kinds of images. The accuracy of the image classification pipeline is around 87%.
Anthology ID:
2022.lrec-1.159
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1480–1489
Language:
URL:
https://aclanthology.org/2022.lrec-1.159
DOI:
Bibkey:
Cite (ACL):
Jose Sosa and Serge Sharoff. 2022. Multimodal Pipeline for Collection of Misinformation Data from Telegram. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1480–1489, Marseille, France. European Language Resources Association.
Cite (Informal):
Multimodal Pipeline for Collection of Misinformation Data from Telegram (Sosa & Sharoff, LREC 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-2/2022.lrec-1.159.pdf
Code
 josesosajs/telegram-data-collection
Data
CORD-19