Studying Generalisability across Abusive Language Detection Datasets

Steve Durairaj Swamy, Anupam Jamatia, Björn Gambäck


Abstract
Work on Abusive Language Detection has tackled a wide range of subtasks and domains. As a result of this, there exists a great deal of redundancy and non-generalisability between datasets. Through experiments on cross-dataset training and testing, the paper reveals that the preconceived notion of including more non-abusive samples in a dataset (to emulate reality) may have a detrimental effect on the generalisability of a model trained on that data. Hence a hierarchical annotation model is utilised here to reveal redundancies in existing datasets and to help reduce redundancy in future efforts.
Anthology ID:
K19-1088
Volume:
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)
Month:
November
Year:
2019
Address:
Hong Kong, China
Editors:
Mohit Bansal, Aline Villavicencio
Venue:
CoNLL
SIG:
SIGNLL
Publisher:
Association for Computational Linguistics
Note:
Pages:
940–950
Language:
URL:
https://aclanthology.org/K19-1088
DOI:
10.18653/v1/K19-1088
Bibkey:
Cite (ACL):
Steve Durairaj Swamy, Anupam Jamatia, and Björn Gambäck. 2019. Studying Generalisability across Abusive Language Detection Datasets. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 940–950, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):
Studying Generalisability across Abusive Language Detection Datasets (Swamy et al., CoNLL 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp22-frontmatter/K19-1088.pdf