Data Augmentation using Machine Translation for Fake News Detection in the Urdu Language

Maaz Amjad, Grigori Sidorov, Alisa Zhila


Abstract
The task of fake news detection is to distinguish legitimate news articles that describe real facts from those which convey deceiving and fictitious information. As the fake news phenomenon is omnipresent across all languages, it is crucial to be able to efficiently solve this problem for languages other than English. A common approach to this task is supervised classification using features of various complexity. Yet supervised machine learning requires substantial amount of annotated data. For English and a small number of other languages, annotated data availability is much higher, whereas for the vast majority of languages, it is almost scarce. We investigate whether machine translation at its present state could be successfully used as an automated technique for annotated corpora creation and augmentation for fake news detection focusing on the English-Urdu language pair. We train a fake news classifier for Urdu on (1) the manually annotated dataset originally in Urdu and (2) the machine-translated version of an existing annotated fake news dataset originally in English. We show that at the present state of machine translation quality for the English-Urdu language pair, the fully automated data augmentation through machine translation did not provide improvement for fake news detection in Urdu.
Anthology ID:
2020.lrec-1.309
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
2537–2542
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.309
DOI:
Bibkey:
Cite (ACL):
Maaz Amjad, Grigori Sidorov, and Alisa Zhila. 2020. Data Augmentation using Machine Translation for Fake News Detection in the Urdu Language. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2537–2542, Marseille, France. European Language Resources Association.
Cite (Informal):
Data Augmentation using Machine Translation for Fake News Detection in the Urdu Language (Amjad et al., LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2020.lrec-1.309.pdf