2024
pdf
bib
abs
Comparative Analysis of Natural Language Processing Models for Malware Spam Email Identification
Francisco Jáñez-Martino
|
Eduardo Fidalgo
|
Rocío Alaiz-Rodríguez
|
Andrés Carofilis
|
Alicia Martínez-Mendoza
Proceedings of the First International Conference on Natural Language Processing and Artificial Intelligence for Cyber Security
Spam email is one of the main vectors of cyberattacks containing scams and spreading malware. Spam emails can contain malicious and external links and attachments with hidden malicious code. Hence, cybersecurity experts seek to detect this type of email to provide earlier and more detailed warnings for organizations and users. This work is based on a binary classification system (with and without malware) and evaluates models that have achieved high performance in other natural language applications, such as fastText, BERT, RoBERTa, DistilBERT, XLM-RoBERTa, and Large Language Models such as LLaMA and Mistral. Using the Spam Email Malware Detection (SEMD-600) dataset, we compare these models regarding precision, recall, F1 score, accuracy, and runtime. DistilBERT emerges as the most suitable option, achieving a recall of 0.792 and a runtime of 1.612 ms per email.
pdf
bib
abs
SpamClus: An Agglomerative Clustering Algorithm for Spam Email Campaigns Detection
Daniel Díaz
|
Wesam Al-Nabki
|
Laura Fernández-Robles
|
Enrique Alegre
|
Eduardo Fidalgo
|
Alicia Martínez-Mendoza
Proceedings of the First International Conference on Natural Language Processing and Artificial Intelligence for Cyber Security
Spam emails constitute a significant proportion of emails received by users, and can result in financial losses or in the download of malware on the victim’s device. Cyberattackers create spam campaigns to deliver spam messages on a large scale and benefit from the low economic investment and anonymity required to create the attacks. In addition to spam filters, raising awareness about active email scams is a relevant measure that helps mitigate the consequences of spam. Therefore, detecting campaigns becomes a relevant task in identifying and alerting the targets of spam. In this paper, we propose an unsupervised learning algorithm, SpamClus_1, an iterative algorithm that groups spam email campaigns using agglomerative clustering. The measures employed to determine the clusters are the minimum number of samples and minimum percentage of similarity within a cluster. Evaluating SpamClus_1 on a set of emails provided by the Spanish National Cybersecurity Institute (INCIBE), we found that the optimal values are 50 minimum samples and a minimum cosine similarity of 0.8. The clustering results show 19 spam datasets with 3048 spam samples out of 6702 emails from a range of three consecutive days and eight spam clusters with 870 spam samples out of 1469 emails from one day.
pdf
bib
abs
CECILIA: Enhancing CSIRT Effectiveness with Transformer-Based Cyber Incident Classification
Juan Jose Delgado Sotes
|
Alicia Martinez Mendoza
|
Andres Carofilis Vasco
|
Eduardo Fidalgo Fernandez
|
Enrique Alegre Gutierrez
Proceedings of the First International Conference on Natural Language Processing and Artificial Intelligence for Cyber Security
This paper introduces an approach to improv ing incident response times by applying various Artificial Intelligence (AI) classification algorithms based on transformers to analyze the efficacy of these models in categorizing cyber incidents. As a first contribution, we developed a cyber incident dataset, CECILIA-10C-900, collecting cyber incident reports from six qualified web sources. The contribution of creating a dataset on cyber incident detection is remarkable due to the scarcity of such datasets. Each incident has been tagged by hand according to the cyber incident taxonomy defined by the CERT (Computer Emergency Response Team) of the National Institute of Cybersecurity (INCIBE). This dataset is highly unbalanced, so we decided to unify the four least represented classes under the label “others”, leaving a dataset with six categories (CECILIA-6C-900). With these reliable datasets, we performed a comparison of the best algorithms specifically for the cyber incident classification problem, evaluating eight different metrics on two conventional classifiers and six other transformer-based classifiers. Our study highlights the importance of having a rapid classification mechanism for CSIRTs (Computer Security Incident Response Teams) and showcases the potential of machine learning algorithms to improve cyber defense mechanisms. The findings from our analysis provide valuable insights into the strengths and limitations of different classification techniques. It can be used in future work on cyber incident response strategies