Variable Typing Annotated Dataset
---------------------------------

This is a pre-release version of the variable typing dataset that is to be attached to the NAACL HLT 2018 submission of the paper titled 

"Variable Typing: Assigning Meaning to Variables in Mathematical Text"

This version of the dataset is:
- NOT for redistribution.
- Likely to change, with contents added and/or removed.

Eventually, this data set will be release under the Open Data Commons License.

---------------------------------

The data set is distributed in the form of 3 tab delimited dat files. The first line is a header.
Each line is a record and each record contains 3 fields in the following order:

seqid: The sequence number of the sentence in the source document

docid: The arXiv ID of the source document

annotation: The annotated sentence. The words in this sentence are white-space delimited. Edges can only occur between variables and words in the sentence. Edges have the format

<m:id>/index

Index can be smaller than 0 (e.g., "-1") to indicate that the edge is negative, or 0 and greater to indicate a positive edge. A non-negative index number forms an edge between the formula with with id "id" and the word in the sentence with index number "index".


There are three files:

- developement.dat: The development (parameter tuning) data set of 841 sentences.
- evaluation.dat: The test, set comprised of 1689 sentences, intended to be used for evaluation (unseen data)
- training.dat: The training set comprised of 5273 sentences used for training models.
