Enrique Alegre

2024

pdf bib abs
SpamClus: An Agglomerative Clustering Algorithm for Spam Email Campaigns Detection
Daniel Díaz | Wesam Al-Nabki | Laura Fernández-Robles | Enrique Alegre | Eduardo Fidalgo | Alicia Martínez-Mendoza
Proceedings of the First International Conference on Natural Language Processing and Artificial Intelligence for Cyber Security

Spam emails constitute a significant proportion of emails received by users, and can result in financial losses or in the download of malware on the victim’s device. Cyberattackers create spam campaigns to deliver spam messages on a large scale and benefit from the low economic investment and anonymity required to create the attacks. In addition to spam filters, raising awareness about active email scams is a relevant measure that helps mitigate the consequences of spam. Therefore, detecting campaigns becomes a relevant task in identifying and alerting the targets of spam. In this paper, we propose an unsupervised learning algorithm, SpamClus_1, an iterative algorithm that groups spam email campaigns using agglomerative clustering. The measures employed to determine the clusters are the minimum number of samples and minimum percentage of similarity within a cluster. Evaluating SpamClus_1 on a set of emails provided by the Spanish National Cybersecurity Institute (INCIBE), we found that the optimal values are 50 minimum samples and a minimum cosine similarity of 0.8. The clustering results show 19 spam datasets with 3048 spam samples out of 6702 emails from a range of three consecutive days and eight spam clusters with 870 spam samples out of 1469 emails from one day.

2017

pdf bib abs
Classifying Illegal Activities on Tor Network Based on Web Textual Contents
Mhd Wesam Al Nabki | Eduardo Fidalgo | Enrique Alegre | Ivan de Paz
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

The freedom of the Deep Web offers a safe place where people can express themselves anonymously but they also can conduct illegal activities. In this paper, we present and make publicly available a new dataset for Darknet active domains, which we call ”Darknet Usage Text Addresses” (DUTA). We built DUTA by sampling the Tor network during two months and manually labeled each address into 26 classes. Using DUTA, we conducted a comparison between two well-known text representation techniques crossed by three different supervised classifiers to categorize the Tor hidden services. We also fixed the pipeline elements and identified the aspects that have a critical influence on the classification results. We found that the combination of TFIDF words representation with Logistic Regression classifier achieves 96.6% of 10 folds cross-validation accuracy and a macro F1 score of 93.7% when classifying a subset of illegal activities from DUTA. The good performance of the classifier might support potential tools to help the authorities in the detection of these activities.

Co-authors

Alicia Martínez-Mendoza 1

Ivan de Paz 1

Venues

eacl1
nlpaics1

Fix data