Chun Man Tsang
2024
Deciphering Cyber Threats: A Unifying Framework with GPT-3.5, BERTopic and Feature Importance
Chun Man Tsang
|
Tom Bell
|
Antonios Gouglidis
|
Mo El-Haj
Proceedings of the First International Conference on Natural Language Processing and Artificial Intelligence for Cyber Security
This paper presents a methodology for the categorisation and attribute quantification of cyber threats. The data was sourced from Common Weakness Enumeration (CWE) entries, encompassing 503 hardware and software vulnerabilities. For each entry, GPT-3.5 generated detailed descriptions for 12 key threat attributes. Employing BERTopic for topic modelling, our research focuses on clustering cyber threats and evaluates the efficacy of various dimensionality reduction and clustering algorithms, notably finding that UMAP combined with HDBSCAN, optimised through parameterisation, outperforms other configurations. The study further explores feature importance analysis by converting topic modelling results into a classification paradigm, achieving classification accuracies between 60% and 80% with algorithms such as Random Forest, XGBoost, and Linear SVM. This feature importance analysis quantifies the significance of each threat attribute, with SHAP identified as the most effective method for this calculation.