DFKI SLT at GermEval 2021: Multilingual Pre-training and Data Augmentation for the Classification of Toxicity in Social Media Comments

Remi Calizzano, Malte Ostendorff, Georg Rehm

[How to correct problems with metadata yourself]


Abstract
We present our submission to the first subtask of GermEval 2021 (classification of German Facebook comments as toxic or not). Binary sequence classification is a standard NLP task with known state-of-the-art methods. Therefore, we focus on data preparation by using two different techniques: task-specific pre-training and data augmentation. First, we pre-train multilingual transformers (XLM-RoBERTa and MT5) on 12 hatespeech detection datasets in nine different languages. In terms of F1, we notice an improvement of 10% on average, using task-specific pre-training. Second, we perform data augmentation by labelling unlabelled comments, taken from Facebook, to increase the size of the training dataset by 79%. Models trained on the augmented training dataset obtain on average +0.0282 (+5%) F1 score compared to models trained on the original training dataset. Finally, the combination of the two techniques allows us to obtain an F1 score of 0.6899 with XLM- RoBERTa and 0.6859 with MT5. The code of the project is available at: https://github.com/airKlizz/germeval2021toxic.
Anthology ID:
2021.germeval-1.4
Volume:
Proceedings of the GermEval 2021 Shared Task on the Identification of Toxic, Engaging, and Fact-Claiming Comments
Month:
September
Year:
2021
Address:
Duesseldorf, Germany
Editors:
Julian Risch, Anke Stoll, Lena Wilms, Michael Wiegand
Venue:
GermEval
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
25–31
Language:
URL:
https://aclanthology.org/2021.germeval-1.4
DOI:
Bibkey:
Cite (ACL):
Remi Calizzano, Malte Ostendorff, and Georg Rehm. 2021. DFKI SLT at GermEval 2021: Multilingual Pre-training and Data Augmentation for the Classification of Toxicity in Social Media Comments. In Proceedings of the GermEval 2021 Shared Task on the Identification of Toxic, Engaging, and Fact-Claiming Comments, pages 25–31, Duesseldorf, Germany. Association for Computational Linguistics.
Cite (Informal):
DFKI SLT at GermEval 2021: Multilingual Pre-training and Data Augmentation for the Classification of Toxicity in Social Media Comments (Calizzano et al., GermEval 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/teach-a-man-to-fish/2021.germeval-1.4.pdf
Code
 airklizz/germeval2021toxic