Abstract
Work on Abusive Language Detection has tackled a wide range of subtasks and domains. As a result of this, there exists a great deal of redundancy and non-generalisability between datasets. Through experiments on cross-dataset training and testing, the paper reveals that the preconceived notion of including more non-abusive samples in a dataset (to emulate reality) may have a detrimental effect on the generalisability of a model trained on that data. Hence a hierarchical annotation model is utilised here to reveal redundancies in existing datasets and to help reduce redundancy in future efforts.- Anthology ID:
- K19-1088
- Volume:
- Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)
- Month:
- November
- Year:
- 2019
- Address:
- Hong Kong, China
- Editors:
- Mohit Bansal, Aline Villavicencio
- Venue:
- CoNLL
- SIG:
- SIGNLL
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 940–950
- Language:
- URL:
- https://aclanthology.org/K19-1088
- DOI:
- 10.18653/v1/K19-1088
- Cite (ACL):
- Steve Durairaj Swamy, Anupam Jamatia, and Björn Gambäck. 2019. Studying Generalisability across Abusive Language Detection Datasets. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 940–950, Hong Kong, China. Association for Computational Linguistics.
- Cite (Informal):
- Studying Generalisability across Abusive Language Detection Datasets (Swamy et al., CoNLL 2019)
- PDF:
- https://preview.aclanthology.org/emnlp22-frontmatter/K19-1088.pdf