SpamClus: An Agglomerative Clustering Algorithm for Spam Email Campaigns Detection

Daniel Díaz, Wesam Al-Nabki, Laura Fernández-Robles, Enrique Alegre, Eduardo Fidalgo, Alicia Martínez-Mendoza


Abstract
Spam emails constitute a significant proportion of emails received by users, and can result in financial losses or in the download of malware on the victim’s device. Cyberattackers create spam campaigns to deliver spam messages on a large scale and benefit from the low economic investment and anonymity required to create the attacks. In addition to spam filters, raising awareness about active email scams is a relevant measure that helps mitigate the consequences of spam. Therefore, detecting campaigns becomes a relevant task in identifying and alerting the targets of spam. In this paper, we propose an unsupervised learning algorithm, SpamClus_1, an iterative algorithm that groups spam email campaigns using agglomerative clustering. The measures employed to determine the clusters are the minimum number of samples and minimum percentage of similarity within a cluster. Evaluating SpamClus_1 on a set of emails provided by the Spanish National Cybersecurity Institute (INCIBE), we found that the optimal values are 50 minimum samples and a minimum cosine similarity of 0.8. The clustering results show 19 spam datasets with 3048 spam samples out of 6702 emails from a range of three consecutive days and eight spam clusters with 870 spam samples out of 1469 emails from one day.
Anthology ID:
2024.nlpaics-1.8
Volume:
Proceedings of the First International Conference on Natural Language Processing and Artificial Intelligence for Cyber Security
Month:
July
Year:
2024
Address:
Lancaster, UK
Editors:
Ruslan Mitkov, Saad Ezzini, Tharindu Ranasinghe, Ignatius Ezeani, Nouran Khallaf, Cengiz Acarturk, Matthew Bradbury, Mo El-Haj, Paul Rayson
Venue:
NLPAICS
SIG:
Publisher:
International Conference on Natural Language Processing and Artificial Intelligence for Cyber Security
Note:
Pages:
64–69
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2024.nlpaics-1.8/
DOI:
Bibkey:
Cite (ACL):
Daniel Díaz, Wesam Al-Nabki, Laura Fernández-Robles, Enrique Alegre, Eduardo Fidalgo, and Alicia Martínez-Mendoza. 2024. SpamClus: An Agglomerative Clustering Algorithm for Spam Email Campaigns Detection. In Proceedings of the First International Conference on Natural Language Processing and Artificial Intelligence for Cyber Security, pages 64–69, Lancaster, UK. International Conference on Natural Language Processing and Artificial Intelligence for Cyber Security.
Cite (Informal):
SpamClus: An Agglomerative Clustering Algorithm for Spam Email Campaigns Detection (Díaz et al., NLPAICS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2024.nlpaics-1.8.pdf