Code:

The nltk version is 3.0.0.

Hindi Urdu Tree Bank Corpus is in CONLL format which NLTK takes as input. The public version of this corpus is here: https://verbs.colorado.edu/hindiurdu/
The output that we get through this variant-generation-code.py code is in linear sentences which are re-ordered pre-verbally.


Dataset:

The datset accompanying this README file can be used to replicate the main findings of our paper:

The dataset consists of Joachims transformed data (serving as input to our statistical analyses scripts written in R) computed from sentences 
belonging to Hindi-Urdu Treebank (HUTB) Corpus (Bhatt et. al., 2009), a standard dataset used in natural language processing. The Joachims transformations 
have been effected using the Joachims 2002 technique described in the paper.

Actual sentences from these licensed corpora are not part of the dataset, but can be recovered using the id provided in the *.txt files.

The hutb-data-replicate.csv file contain the following fields in comma-separated format. Their definitions are elaborated below.
#--------------------------------------------------------------------#

"choice",
"SentID",
"DO",
"IO",
"Given_New",
"Root",
"Lemma",
"Ordering_Ref",
"dl",
"lm",
"lmlk",
"IS_Score",
"nulm",
"Prev1adnulm",
"Prev5adnulm",
"Prev1lm"

#--------------------------------------------------------------------#
"choice": choice is encoded by the binary dependent variable (1: reference preference and 0: variant preference)

"SentID": Sentence id (file number.sentence number in a given file.variant number provided; variant number 0 indicates the reference sentence)

"DO": Direct object (DO) is fronted in the reference sentence, while the variant has the canonical order of subject preceding DO
        
        k2k1: referent sentence has k2k1 annotation but variant is k1k2 annotation.
        none: sentence which does not encode DO-fronted non-canonical ordering


"IO": Indirect object (IO) is fronted in the reference sentence, while the variant has the canonical order of subject preceding IO
        
        k4k1: referent sentence has k4k1 annotation but variant is k1k4 annotation.
        none: sentence which does not encode IO-fronted non-canonical ordering


"Given_New": Sentences annotated with given-new ordering based on the annotation scheme described in the paper

"Root": The main verb (lexical item) of the sentence which acts as root of the associated dependency tree

"Lemma": The lemma form of the main verb of the sentence which acts as root of the associated dependency tree

"Ordering_Ref": The encodes the ordering of arguments in the sentence (k1: subject, k4: indirect object, k2: direct object). 
	For example: k1_k2 denotes subject is followed by the direct object.


"dl": dependency length of the sentence

"lm": trigram surprisal of the sentence

"lmlk": PCFG syntactic surprisal

"IS_Score": Information structure metric based on the scheme described in the paper

"nulm": LSTM surprisal of the sentence

"Prev1adnulm": Adaptive LSTM surprisal when the base LSTM or vanilla LSTM was adapted to only 1 preceding sentence in the discourse

"Prev5adnulm": Adaptive LSTM surprisal when the base LSTM or vanilla LSTM was adapted to only 5 preceding sentence in the discourse

"Prev1lm" : Lexical repetition surprisal obtained by interpolating the vanilla trigram language model ("lm") to one preceding sentence in the discourse.

#--------------------------------------------------------------------#
