Comparison of String Similarity Measures for Obscenity Filtering

Ekaterina Chernyak


Abstract
In this paper we address the problem of filtering obscene lexis in Russian texts. We use string similarity measures to find words similar or identical to words from a stop list and establish both a test collection and a baseline for the task. Our experiments show that a novel string similarity measure based on the notion of an annotated suffix tree outperforms some of the other well known measures.
Anthology ID:
W17-1415
Volume:
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing
Month:
April
Year:
2017
Address:
Valencia, Spain
Editors:
Tomaž Erjavec, Jakub Piskorski, Lidia Pivovarova, Jan Šnajder, Josef Steinberger, Roman Yangarber
Venue:
BSNLP
SIG:
SIGSLAV
Publisher:
Association for Computational Linguistics
Note:
Pages:
97–101
Language:
URL:
https://aclanthology.org/W17-1415
DOI:
10.18653/v1/W17-1415
Bibkey:
Cite (ACL):
Ekaterina Chernyak. 2017. Comparison of String Similarity Measures for Obscenity Filtering. In Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing, pages 97–101, Valencia, Spain. Association for Computational Linguistics.
Cite (Informal):
Comparison of String Similarity Measures for Obscenity Filtering (Chernyak, BSNLP 2017)
Copy Citation:
PDF:
https://preview.aclanthology.org/ml4al-ingestion/W17-1415.pdf