“Zo Grof !”: A Comprehensive Corpus for Offensive and Abusive Language in Dutch
Ward Ruitenbeek, Victor Zwart, Robin Van Der Noord, Zhenja Gnezdilov, Tommaso Caselli
Abstract
This paper presents a comprehensive corpus for the study of socially unacceptable language in Dutch. The corpus extends and revise an existing resource with more data and introduces a new annotation dimension for offensive language, making it a unique resource in the Dutch language panorama. Each language phenomenon (abusive and offensive language) in the corpus has been annotated with a multi-layer annotation scheme modelling the explicitness and the target(s) of the message. We have conducted a new set of experiments with different classification algorithms on all annotation dimensions. Monolingual Pre-Trained Language Models prove as the best systems, obtaining a macro-average F1 of 0.828 for binary classification of offensive language, and 0.579 for the targets of offensive messages. Furthermore, the best system obtains a macro-average F1 of 0.667 for distinguishing between abusive and offensive messages.- Anthology ID:
- 2022.woah-1.5
- Volume:
- Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH)
- Month:
- July
- Year:
- 2022
- Address:
- Seattle, Washington (Hybrid)
- Venue:
- WOAH
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 40–56
- Language:
- URL:
- https://aclanthology.org/2022.woah-1.5
- DOI:
- 10.18653/v1/2022.woah-1.5
- Cite (ACL):
- Ward Ruitenbeek, Victor Zwart, Robin Van Der Noord, Zhenja Gnezdilov, and Tommaso Caselli. 2022. “Zo Grof !”: A Comprehensive Corpus for Offensive and Abusive Language in Dutch. In Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH), pages 40–56, Seattle, Washington (Hybrid). Association for Computational Linguistics.
- Cite (Informal):
- “Zo Grof !”: A Comprehensive Corpus for Offensive and Abusive Language in Dutch (Ruitenbeek et al., WOAH 2022)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2022.woah-1.5.pdf
- Code
- tommasoc80/dalc