-------------------------------------------------
-- DATA USED FOR SEMANTICALLY SMOOTH EMBEDDING --
-------------------------------------------------

------------------
OUTLINE:
1. Introduction
2. Content
3. Data Format
4. Data Statistics
5. How to Cite
6. Contact
------------------


------------------
1. INTRODUCTION:
------------------

These are the three data sets LOCATION, SPORT, and NELL186 used for Semantically Smooth
Knowledge Graph Embedding. All the three data sets are extracted from the Never-Ending
Language Learning system (http://rtw.ml.cmu.edu/). 


------------------
2. CONTENT:
------------------

The data archive contains 1 README file + 3 folders:
  - README: the specification document
  - Folder location: the LOCATION data set
  - Folder sport: the SPORT data set
  - Folder nell186: the NELL186 data set

Experiments on the LOCATION and SPORT data sets are repeated 5 times by drawing new
training/validation/test splits. Data associated with each split is stored in a
subfolder named fold_*.

Each folder/subfolder contains 6 files:
  - {dataset}_triples.train
  - {dataset}_triples.valid
  - {dataset}_triples.test
  - {dataset}_triples.neg.valid
  - {dataset}_triples.neg.test
  - {dataset}_catinfo
  
The 3 files {dataset}_triples.train/valid/test contain the observed triples
(training/validation/test sets). They are used in both link prediction and
triple classification.

The 2 files {dataset}_triples.neg.valid/test contain the negative triples
constructed for positive ones in the validation/test sets. They are used
only in triple classification.

The file {dataset}_catinfo contains entities' category labels.


------------------
3. DATA FORMAT
------------------

The {dataset}_triples.* files contain one triple per line, stored in a tab ('\t')
separated format. The first element is the head entity, the second the relation,
and the third the tail entity.

The {dataset}_catinfo file contains one entity and its category label per line,
also stored in a tab ('\t') separated format. The first element is the entity,
the second a new relation 'generalizations', and the third the category label.


------------------
4. DATA STATISTICS
------------------

The LOCATION data set consists of 380 entities and 8 relations among them.
There are 718 triples in total, split into the training/validation/test sets
with the ratio of 3:1:1.

The SPORT data set consists of 1,520 entities and 8 relations among them.
There are 3,826 triples in total, split into the training/validation/test sets
with the ratio of 3:1:1.

The NELL186 data sets consists of 14,463 entities and 186 relations among them.
The training set contains 31,134 triples, the validation set 5,000 triples,
and the test set 5,000 triples.

All triples are unique and we made sure that all entities/relations appearing in
the validation or test sets were occurring in the training set.


------------------
5. HOW TO CITE
------------------

When using this data, one should cite the original paper:
  @inproceedings{guo2015:SSE,
    title     = {Semantically Smooth Knowledge Graph Embedding},
    author    = {Shu Guo and Quan Wang and Lihong Wang and Bin Wang and Li Guo},
    booktitle = {Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
    year      = {2015},
    note      = {to appear}
  }


------------------  
6. CONTACT
------------------

For all remarks or questions please contact Quan Wang:
wangquan (at) iie (dot) ac (dot) cn .


