A Novel Corpus for Automated Sexism Identification on Social Media

Lutfiye Seda Mut Altin, Horacio Saggion


Abstract
In this paper, we present a novel dataset for the study of automated sexism identification and categorization on social media in Turkish. For this purpose, we have collected, following a well established methodology, a set of Tweets and YouTube comments. Relying on expert organizations in the area of gender equality, each text has been annotated based on a two-level labelling schema derived from previous research. Our resulting dataset consists of around 7,000 annotated instances useful for the study of expressions of sexism and misogyny on the Web. To the best of our knowledge, this is the first two-level manually annotated comprehensive Turkish dataset for sexism identification. In order to fuel research in this relevant area, we also present the result of our benchmarking experiments in the area of sexism identification in Turkish.
Anthology ID:
2024.sigul-1.2
Volume:
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Maite Melero, Sakriani Sakti, Claudia Soria
Venues:
SIGUL | WS
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
10–15
Language:
URL:
https://aclanthology.org/2024.sigul-1.2
DOI:
Bibkey:
Cite (ACL):
Lutfiye Seda Mut Altin and Horacio Saggion. 2024. A Novel Corpus for Automated Sexism Identification on Social Media. In Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024, pages 10–15, Torino, Italia. ELRA and ICCL.
Cite (Informal):
A Novel Corpus for Automated Sexism Identification on Social Media (Mut Altin & Saggion, SIGUL-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2024.sigul-1.2.pdf