Abstract
We present our submission to the first subtask of GermEval 2021 (classification of German Facebook comments as toxic or not). Binary sequence classification is a standard NLP task with known state-of-the-art methods. Therefore, we focus on data preparation by using two different techniques: task-specific pre-training and data augmentation. First, we pre-train multilingual transformers (XLM-RoBERTa and MT5) on 12 hatespeech detection datasets in nine different languages. In terms of F1, we notice an improvement of 10% on average, using task-specific pre-training. Second, we perform data augmentation by labelling unlabelled comments, taken from Facebook, to increase the size of the training dataset by 79%. Models trained on the augmented training dataset obtain on average +0.0282 (+5%) F1 score compared to models trained on the original training dataset. Finally, the combination of the two techniques allows us to obtain an F1 score of 0.6899 with XLM- RoBERTa and 0.6859 with MT5. The code of the project is available at: https://github.com/airKlizz/germeval2021toxic.- Anthology ID:
- 2021.germeval-1.4
- Volume:
- Proceedings of the GermEval 2021 Shared Task on the Identification of Toxic, Engaging, and Fact-Claiming Comments
- Month:
- September
- Year:
- 2021
- Address:
- Duesseldorf, Germany
- Editors:
- Julian Risch, Anke Stoll, Lena Wilms, Michael Wiegand
- Venue:
- GermEval
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 25–31
- Language:
- URL:
- https://aclanthology.org/2021.germeval-1.4
- DOI:
- Cite (ACL):
- Remi Calizzano, Malte Ostendorff, and Georg Rehm. 2021. DFKI SLT at GermEval 2021: Multilingual Pre-training and Data Augmentation for the Classification of Toxicity in Social Media Comments. In Proceedings of the GermEval 2021 Shared Task on the Identification of Toxic, Engaging, and Fact-Claiming Comments, pages 25–31, Duesseldorf, Germany. Association for Computational Linguistics.
- Cite (Informal):
- DFKI SLT at GermEval 2021: Multilingual Pre-training and Data Augmentation for the Classification of Toxicity in Social Media Comments (Calizzano et al., GermEval 2021)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/2021.germeval-1.4.pdf
- Code
- airklizz/germeval2021toxic