Heejun Lee


2022

pdf
K-MHaS: A Multi-label Hate Speech Detection Dataset in Korean Online News Comment
Jean Lee | Taejun Lim | Heejun Lee | Bogeun Jo | Yangsok Kim | Heegeun Yoon | Soyeon Caren Han
Proceedings of the 29th International Conference on Computational Linguistics

Online hate speech detection has become an important issue due to the growth of online content, but resources in languages other than English are extremely limited. We introduce K-MHaS, a new multi-label dataset for hate speech detection that effectively handles Korean language patterns. The dataset consists of 109k utterances from news comments and provides a multi-label classification using 1 to 4 labels, and handles subjectivity and intersectionality. We evaluate strong baselines on K-MHaS. KR-BERT with a sub-character tokenizer outperforms others, recognizing decomposed characters in each hate speech class.