Abstract
In this paper we address the problem of filtering obscene lexis in Russian texts. We use string similarity measures to find words similar or identical to words from a stop list and establish both a test collection and a baseline for the task. Our experiments show that a novel string similarity measure based on the notion of an annotated suffix tree outperforms some of the other well known measures.- Anthology ID:
- W17-1415
- Volume:
- Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing
- Month:
- April
- Year:
- 2017
- Address:
- Valencia, Spain
- Editors:
- Tomaž Erjavec, Jakub Piskorski, Lidia Pivovarova, Jan Šnajder, Josef Steinberger, Roman Yangarber
- Venue:
- BSNLP
- SIG:
- SIGSLAV
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 97–101
- Language:
- URL:
- https://aclanthology.org/W17-1415
- DOI:
- 10.18653/v1/W17-1415
- Cite (ACL):
- Ekaterina Chernyak. 2017. Comparison of String Similarity Measures for Obscenity Filtering. In Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing, pages 97–101, Valencia, Spain. Association for Computational Linguistics.
- Cite (Informal):
- Comparison of String Similarity Measures for Obscenity Filtering (Chernyak, BSNLP 2017)
- PDF:
- https://preview.aclanthology.org/ml4al-ingestion/W17-1415.pdf