TOCP: A Dataset for Chinese Profanity Processing

Hsu Yang, Chuan-Jie Lin


Abstract
This paper introduced TOCP, a larger dataset of Chinese profanity. This dataset contains natural sentences collected from social media sites, the profane expressions appearing in the sentences, and their rephrasing suggestions which preserve their meanings in a less offensive way. We proposed several baseline systems using neural network models to test this benchmark. We trained embedding models on a profanity-related dataset and proposed several profanity-related features. Our baseline systems achieved an F1-score of 86.37% in profanity detection and an accuracy of 77.32% in profanity rephrasing.
Anthology ID:
2020.trac-1.2
Volume:
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Ritesh Kumar, Atul Kr. Ojha, Bornini Lahiri, Marcos Zampieri, Shervin Malmasi, Vanessa Murdock, Daniel Kadar
Venue:
TRAC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
6–12
Language:
English
URL:
https://aclanthology.org/2020.trac-1.2
DOI:
Bibkey:
Cite (ACL):
Hsu Yang and Chuan-Jie Lin. 2020. TOCP: A Dataset for Chinese Profanity Processing. In Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying, pages 6–12, Marseille, France. European Language Resources Association (ELRA).
Cite (Informal):
TOCP: A Dataset for Chinese Profanity Processing (Yang & Lin, TRAC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/2020.trac-1.2.pdf