Prediction of Learning Curves in Machine Translation
-----------------------------------------------------
This folder contains supplementary materials relevant to the work
presented in this paper. 
    Prediction of Learning Curves in Machine Translation
    Prasanth Kolachina, Nicola Cancedda, Marc Dymetman and Sriram Venkatapathy

We make the following data available for the sake of reproducibility and verifiability. 
    1.  BLEU scores obtained at different sizes of training corpus
    over a multitude of configurations (language pair and corpora domain).
    These values are used in our experiments on curve fitting ( Section 3 in the paper ).

    2.  The values that the gold curve (Pow_3) estimates at the three anchor sizes which are the object of prediction throughout the rest of the paper. 
    Technically, the parameter values of the gold curve can be reproduced by fitting the Pow_3 curve family to these 3 points. 

    3.  The feature values used to predict the BLEU scores at the three anchor points for all the 96 learning curves. 

    4.  The predictions made at the anchor points using 
	a)  Ridge regression model
	b)  Lasso regression model
	c)  Constant mean baseline model 
    from features collected over the monolingual corpus of source and target languages and the corresponding evaluations for predictions made
    at the anchor points from each of the regression models used in our experiments. 

The directory structure is as follows: 
    1. The points used in the experiments on Curve Fitting are available in
    directory 'expt-data/curve-fitting/points/bleu'. There are 30 files,
    each file corresponding to a single configuration reported in the paper. 

    2. The experimental settings used to train the phrase-based statistical
    machine translation models using Moses can be found in the file
    'expt-data/curve-fitting/README_TranslationModels'. 

    3. The values that the gold curve (Pow_3) estimates at the three anchor
    sizes are available in the following location:
    'expt-data/curve-fitting/evaluation/pow-values.txt'. 

    4. The feature values are available in the 'expt-data/feature-extraction' directory. 

    5. The predictions made at the three anchor sizes are found in
    'expt-data/inference/*/cumul-predictions.txt' and the corresponding
    evaluation in 'expt-data/inference/*/evaluation.txt' where * is baseline
    (corresponding to baseline predictor), ridge (L2 linear model) or lasso
    (L1 linear model).

The format of each of the files is explained in the README document
available in the corresponding directory.


In case of any further questions, contact:
  Prasanth Kolachina at prasanth_k@research.iiit.ac.in.
