A Dataset of Offensive Language in Kosovo Social Media

Adem Ajvazi, Christian Hardmeier


Abstract
Social media are a central part of people’s lives. Unfortunately, many public social media spaces are rife with bullying and offensive language, creating an unsafe environment for their users. In this paper, we present a new dataset for offensive language detection in Albanian. The dataset is composed of user-generated comments on Facebook and YouTube from the channels of selected Kosovo news platforms. It is annotated according to the three levels of the OLID annotation scheme. We also show results of a baseline system for offensive language classification based on a fine-tuned BERT model and compare with the Danish DKhate dataset, which is similar in scope and size. In a transfer learning setting, we find that merging the Albanian and Danish training sets leads to improved performance for prediction on Danish, but not Albanian, on both offensive language recognition and distinguishing targeted and untargeted offence.
Anthology ID:
2022.lrec-1.198
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1860–1869
Language:
URL:
https://aclanthology.org/2022.lrec-1.198
DOI:
Bibkey:
Cite (ACL):
Adem Ajvazi and Christian Hardmeier. 2022. A Dataset of Offensive Language in Kosovo Social Media. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1860–1869, Marseille, France. European Language Resources Association.
Cite (Informal):
A Dataset of Offensive Language in Kosovo Social Media (Ajvazi & Hardmeier, LREC 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/remove-xml-comments/2022.lrec-1.198.pdf
Data
DKhateOLID