DFKI SLT at GermEval 2021: Multilingual Pre-training and Data Augmentation for the Classification of Toxicity in Social Media Comments

Rémi Calizzano; Malte Ostendorff; Georg Rehm

DFKI SLT at GermEval 2021: Multilingual Pre-training and Data Augmentation for the Classification of Toxicity in Social Media Comments

Remi Calizzano, Malte Ostendorff, Georg Rehm

Abstract

We present our submission to the first subtask of GermEval 2021 (classification of German Facebook comments as toxic or not). Binary sequence classification is a standard NLP task with known state-of-the-art methods. Therefore, we focus on data preparation by using two different techniques: task-specific pre-training and data augmentation. First, we pre-train multilingual transformers (XLM-RoBERTa and MT5) on 12 hatespeech detection datasets in nine different languages. In terms of F1, we notice an improvement of 10% on average, using task-specific pre-training. Second, we perform data augmentation by labelling unlabelled comments, taken from Facebook, to increase the size of the training dataset by 79%. Models trained on the augmented training dataset obtain on average +0.0282 (+5%) F1 score compared to models trained on the original training dataset. Finally, the combination of the two techniques allows us to obtain an F1 score of 0.6899 with XLM- RoBERTa and 0.6859 with MT5. The code of the project is available at: https://github.com/airKlizz/germeval2021toxic.

Anthology ID:: 2021.germeval-1.4
Volume:: Proceedings of the GermEval 2021 Shared Task on the Identification of Toxic, Engaging, and Fact-Claiming Comments
Month:: September
Year:: 2021
Address:: Duesseldorf, Germany
Editors:: Julian Risch, Anke Stoll, Lena Wilms, Michael Wiegand
Venue:: GermEval
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 25–31
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2021.germeval-1.4/
DOI:
Bibkey:
Cite (ACL):: Remi Calizzano, Malte Ostendorff, and Georg Rehm. 2021. DFKI SLT at GermEval 2021: Multilingual Pre-training and Data Augmentation for the Classification of Toxicity in Social Media Comments. In Proceedings of the GermEval 2021 Shared Task on the Identification of Toxic, Engaging, and Fact-Claiming Comments, pages 25–31, Duesseldorf, Germany. Association for Computational Linguistics.
Cite (Informal):: DFKI SLT at GermEval 2021: Multilingual Pre-training and Data Augmentation for the Classification of Toxicity in Social Media Comments (Calizzano et al., GermEval 2021)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2021.germeval-1.4.pdf
Code: airklizz/germeval2021toxic

PDF Cite Search Code Fix data