Comparative Analysis of Natural Language Processing Models for Malware Spam Email Identification
Francisco Jáñez-Martino, Eduardo Fidalgo, Rocío Alaiz-Rodríguez, Andrés Carofilis, Alicia Martínez-Mendoza
Abstract
Spam email is one of the main vectors of cyberattacks containing scams and spreading malware. Spam emails can contain malicious and external links and attachments with hidden malicious code. Hence, cybersecurity experts seek to detect this type of email to provide earlier and more detailed warnings for organizations and users. This work is based on a binary classification system (with and without malware) and evaluates models that have achieved high performance in other natural language applications, such as fastText, BERT, RoBERTa, DistilBERT, XLM-RoBERTa, and Large Language Models such as LLaMA and Mistral. Using the Spam Email Malware Detection (SEMD-600) dataset, we compare these models regarding precision, recall, F1 score, accuracy, and runtime. DistilBERT emerges as the most suitable option, achieving a recall of 0.792 and a runtime of 1.612 ms per email.- Anthology ID:
- 2024.nlpaics-1.7
- Volume:
- Proceedings of the First International Conference on Natural Language Processing and Artificial Intelligence for Cyber Security
- Month:
- July
- Year:
- 2024
- Address:
- Lancaster, UK
- Editors:
- Ruslan Mitkov, Saad Ezzini, Tharindu Ranasinghe, Ignatius Ezeani, Nouran Khallaf, Cengiz Acarturk, Matthew Bradbury, Mo El-Haj, Paul Rayson
- Venue:
- NLPAICS
- SIG:
- Publisher:
- International Conference on Natural Language Processing and Artificial Intelligence for Cyber Security
- Note:
- Pages:
- 59–63
- Language:
- URL:
- https://preview.aclanthology.org/fix-sig-urls/2024.nlpaics-1.7/
- DOI:
- Cite (ACL):
- Francisco Jáñez-Martino, Eduardo Fidalgo, Rocío Alaiz-Rodríguez, Andrés Carofilis, and Alicia Martínez-Mendoza. 2024. Comparative Analysis of Natural Language Processing Models for Malware Spam Email Identification. In Proceedings of the First International Conference on Natural Language Processing and Artificial Intelligence for Cyber Security, pages 59–63, Lancaster, UK. International Conference on Natural Language Processing and Artificial Intelligence for Cyber Security.
- Cite (Informal):
- Comparative Analysis of Natural Language Processing Models for Malware Spam Email Identification (Jáñez-Martino et al., NLPAICS 2024)
- PDF:
- https://preview.aclanthology.org/fix-sig-urls/2024.nlpaics-1.7.pdf