Cheng-Wei Tsai
2025
Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training
Yao-Ching Yu
|
Tsun-Han Chiang
|
Cheng-Wei Tsai
|
Chien-Ming Huang
|
Wen-Kwang Tsao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large Language Models (LLMs) have shown remarkable advancements in specialized fields such as finance, law, and medicine. However, in cybersecurity, we have noticed a lack of open-source datasets, with a particular lack of high-quality cybersecurity pretraining corpora, even though much research indicates that LLMs acquire their knowledge during pretraining. To address this, we present a comprehensive suite of datasets covering all major training stages, including pretraining, instruction fine-tuning, and reasoning distillation with cybersecurity-specific self-reflection data. Extensive ablation studies demonstrate their effectiveness on public cybersecurity benchmarks. In particular, continued pre-training on our dataset yields a **15.9%** improvement in the aggregate score, while reasoning distillation leads to a **15.8%** gain in security certification (CISSP). We will release all datasets and trained cybersecurity LLMs under the ODC-BY and MIT licenses to encourage further research in the community.