DFKI SLT at GermEval 2021: Multilingual Pre-training and Data Augmentation for the Classification of Toxicity in Social Media Comments

Rémi Calizzano; Malte Ostendorff; Georg Rehm

DFKI SLT at GermEval 2021: Multilingual Pre-training and Data Augmentation for the Classification of Toxicity in Social Media Comments

Remi Calizzano, Malte Ostendorff, Georg Rehm

[How to correct problems with metadata yourself]

Abstract

We present our submission to the first subtask of GermEval 2021 (classification of German Facebook comments as toxic or not). Binary sequence classification is a standard NLP task with known state-of-the-art methods. Therefore, we focus on data preparation by using two different techniques: task-specific pre-training and data augmentation. First, we pre-train multilingual transformers (XLM-RoBERTa and MT5) on 12 hatespeech detection datasets in nine different languages. In terms of F1, we notice an improvement of 10% on average, using task-specific pre-training. Second, we perform data augmentation by labelling unlabelled comments, taken from Facebook, to increase the size of the training dataset by 79%. Models trained on the augmented training dataset obtain on average +0.0282 (+5%) F1 score compared to models trained on the original training dataset. Finally, the combination of the two techniques allows us to obtain an F1 score of 0.6899 with XLM- RoBERTa and 0.6859 with MT5. The code of the project is available at: https://github.com/airKlizz/germeval2021toxic.

Anthology ID:: 2021.germeval-1.4
Volume:: Proceedings of the GermEval 2021 Shared Task on the Identification of Toxic, Engaging, and Fact-Claiming Comments
Month:: September
Year:: 2021
Address:: Duesseldorf, Germany
Editors:: Julian Risch, Anke Stoll, Lena Wilms, Michael Wiegand
Venue:: GermEval
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 25–31
Language:
URL:: https://aclanthology.org/2021.germeval-1.4
DOI:
Bibkey:
Cite (ACL):: Remi Calizzano, Malte Ostendorff, and Georg Rehm. 2021. DFKI SLT at GermEval 2021: Multilingual Pre-training and Data Augmentation for the Classification of Toxicity in Social Media Comments. In Proceedings of the GermEval 2021 Shared Task on the Identification of Toxic, Engaging, and Fact-Claiming Comments, pages 25–31, Duesseldorf, Germany. Association for Computational Linguistics.
Cite (Informal):: DFKI SLT at GermEval 2021: Multilingual Pre-training and Data Augmentation for the Classification of Toxicity in Social Media Comments (Calizzano et al., GermEval 2021)
Copy Citation:
PDF:: https://preview.aclanthology.org/teach-a-man-to-fish/2021.germeval-1.4.pdf
Code: airklizz/germeval2021toxic

PDF Search Code