In this file folder, we offer the codes of our CNN-based model and a set of training, validation and test data as examples.

Files:

Train_idx.txt, valid_idx.txt, test_idx.txt contain indexes of features for each token in training, validation and test set. In these index files, one line is associated to one token. For one line of indexes, the first digit is the index of the position feature of the token, and the last digit is the label of the token (1=A, 2=B, 0=O), while others are indexes of the syntactic path. We also consider a fixed size window of tokens around the current token, and we set the windows size w=3.

word_embed.txt contains the embeddings of tokens in syntactic paths. Both the constituency path and the dependency path between the cue and the token can be regarded as the special sentences, whose words can be tokens of sentences, syntactic categories, dependency relations, and arrows, just as described in our paper.

Codes:

Firstly, we should run mycnn\data.py to produce two files: mydata.pkl and emb.pkl. The file mydata.pkl contains indexes of training, validation and test set. The other one, emb.pkl, contains the embeddings (in word_embed.txt) of the tokens in syntactic paths. We run gzip command (in Linux) to compress mydata.pkl and emb.pkl into mydata.pkl.gz and emb.pkl.gz.

To assign labels for tokens in the test set, we run the program mycnn\cnn.py. It is notable that these codes are just offered as examples. When we perform 10-fold cross-validation on Abstracts sub-corpus, there are only training and test sets, and the validation sets should be merged into training sets. When we perform the cross-domain evaluation on Clinical Records and Full Papers trained on Abstracts, we get the predicted labels for test set utilizing the parameters of our CNN-based model (with the function load_params() in mycnn\cnn.py), which can produce the best performance on validation set after running the program mycnn\cnn.py for several times.
