# Datasets

## TRAM

This dataset belongs to  [CTID](https://mitre-engenuity.org/cybersecurity/center-for-threat-informed-defense/), is originally provided in this [github link](https://github.com/center-for-threat-informed-defense/tram). 

We processed the original files (i.e., gather from all sources, remove duplicates, resolve noisy / too short text and noisy labels, remap to MITRE ATTACK 12.0) and split  into training, dev and test splits.

## Procedure+

The dataset consists of two sub- datasets:
- Procedures: belong to [MITRE](https://github.com/mitre/cti/tree/master). All procedure examples from v12.0 are gathered and processed (i.e., remove markups) and split  into training, dev and test splits.
- Derived procedures: we crawled the URL references for each procedure example, and extract original text from the articles that are determined to be relevant to the procedure examples. The text are processed and split  into training, dev and test splits.

## Expert

The dataset is constructed by us (detailed in Appendix C), which are collected and selected from a large pool of high-quality threat reports. The rich textual paragraphs are annotated by seasoned security experts.

The dataset is also pre-split into training, dev and test splits. As described in our paper, there is a sub-set of the dataset with significantly higher recall (>90%) that we then use as the final test split. 

Therefore, the split named: `expert_test_not_used_split` can be either merged to `dev` or `train` split. We did not use this split in our experiment (as it has no effects).