## External Data Used in SMART-KPE

#### Description

This folder contains the external data we crawled and used in our model other than the original [OpenKP](https://github.com/microsoft/OpenKP) dataset. We provided the original title and snapshot of a website. Data are divided into train/test/validation sets according to the official split. The number in the file name corresponding to its line number in the official dataset file. 

* **Snapshot**: All the snapshots are stored in .png format and is resized to 256*256, For some inaccessible websites, it will show as a custom 404 page, a google chrome error page or a blank page according to the error it actually occured.
* **Title**: All the titles are stored in .txt format. For some inaccessible websites, the content will be [NULL] or blank. 

Due to the file size limitation of the paper submission system, we only provide 20 examples for every split in our dataset.  

#### How To Use

* For snapshots, the data should be preprocessed by some kinds of visual embedding network such as ResNet and saved to .npy files. Then set ```--meta_dir```  to the folder which contains the processed snapshots in following structure:

  ```
  meta_dir
  └─ snapshot
        ├─ train_res
   	  |	     ├─ 0.npy
   	  |		 ├─ 1.npy
   	  |		 :
   	  |		 :
   	  |		 └─ 134893.npy
  	  ├─ test_res
   	  |	     ├─ 0.npy
   	  |		 ├─ 1.npy
   	  |		 :
   	  |		 :
   	  |		 └─ 6613.npy
   	  └─ dev_res
   	   	     ├─ 0.npy
   	   		 ├─ 1.npy
   	   		 :
   	   		 :
   	   		 └─ 6615.npy
  ```

   A simple example code used by us to process the training set is included in ```preprocessing code``` folder。

* For titles, the data should be embedded to the official dataset. Add an additional `title` field which contains the corresponding title to every dictionary in the official ```train/test/dev.jsonl``` file and use the processed file as normal. 