CoRAL: a Context-aware Croatian Abusive Language Dataset

Ravi Shekhar, Mladen Karan, Matthew Purver


Abstract
In light of unprecedented increases in the popularity of the internet and social media, comment moderation has never been a more relevant task. Semi-automated comment moderation systems greatly aid human moderators by either automatically classifying the examples or allowing the moderators to prioritize which comments to consider first. However, the concept of inappropriate content is often subjective, and such content can be conveyed in many subtle and indirect ways. In this work, we propose CoRAL – a language and culturally aware Croatian Abusive dataset covering phenomena of implicitness and reliance on local and global context. We show experimentally that current models degrade when comments are not explicit and further degrade when language skill and context knowledge are required to interpret the comment.
Anthology ID:
2022.findings-aacl.21
Volume:
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022
Month:
November
Year:
2022
Address:
Online only
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
217–225
Language:
URL:
https://aclanthology.org/2022.findings-aacl.21
DOI:
Bibkey:
Cite (ACL):
Ravi Shekhar, Mladen Karan, and Matthew Purver. 2022. CoRAL: a Context-aware Croatian Abusive Language Dataset. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 217–225, Online only. Association for Computational Linguistics.
Cite (Informal):
CoRAL: a Context-aware Croatian Abusive Language Dataset (Shekhar et al., Findings 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/auto-file-uploads/2022.findings-aacl.21.pdf