Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training

Yao-Ching Yu; Tsun-Han Chiang; Cheng-Wei Tsai; Chien-Ming Huang; Wen-Kwang Tsao

Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training

Yao-Ching Yu, Tsun-Han Chiang, Cheng-Wei Tsai, Chien-Ming Huang, Wen-Kwang Tsao

Abstract

Large Language Models (LLMs) have shown remarkable advancements in specialized fields such as finance, law, and medicine. However, in cybersecurity, we have noticed a lack of open-source datasets, with a particular lack of high-quality cybersecurity pretraining corpora, even though much research indicates that LLMs acquire their knowledge during pretraining. To address this, we present a comprehensive suite of datasets covering all major training stages, including pretraining, instruction fine-tuning, and reasoning distillation with cybersecurity-specific self-reflection data. Extensive ablation studies demonstrate their effectiveness on public cybersecurity benchmarks. In particular, continued pre-training on our dataset yields a **15.9%** improvement in the aggregate score, while reasoning distillation leads to a **15.8%** gain in security certification (CISSP). We will release all datasets and trained cybersecurity LLMs under the ODC-BY and MIT licenses to encourage further research in the community.

Anthology ID:: 2025.emnlp-main.527
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10402–10424
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.527/
DOI:
Bibkey:
Cite (ACL):: Yao-Ching Yu, Tsun-Han Chiang, Cheng-Wei Tsai, Chien-Ming Huang, and Wen-Kwang Tsao. 2025. Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 10402–10424, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training (Yu et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.527.pdf
Checklist:: 2025.emnlp-main.527.checklist.pdf

PDF Cite Search Checklist Fix data