This directory contains all the files needed to replicate the results in the paper:

"Combining Morpheme-based Machine Translation with Post-processing Morpheme Prediction"

In Proceedings of ACL 2011.

There are four directories:

crf = contains the data to train the post-processing morpheme prediction model using crfsgd

moses_baseline = the word-based baseline data

moses_segmented = the Finnish data here was segmented using unsupervised morpheme segmentation

moses_seg_peeled = the segmented data used as input to our post-processing morpheme prediction model

You can use any CRF software that uses the CRF++ data file format to train the post-processing 
morpheme prediction model. We used crfsgd by Leon Bottou. The parameters used for training
were the defaults paramters which are provided below in parentheses for future reference.

To run the crf (available from http://leon.bottou.org/projects/sgd):

Usage (training): crfsgd [options] model template traindata [devdata]
Usage (tagging):  crfsgd -t model testdata
Options for training:
 -c <num> : capacity control parameter (1.0)
 -f <num> : threshold on the occurences of each feature (3)
 -r <num> : total number of epochs (50)
 -h <num> : epochs between each testing phase (10)
 -e <cmd> : performance evaluation command (conlleval -q)

Using the data files above, in order to replicate the experiments in our paper you will
need to run Moses using the method described below. We do not provide here the changes
needed to tune Moses against a word based reference since tuning against the segmented
reference gave us better results.

To run moses (available from http://www.statmt.org/moses/):


train-model.perl -root-dir root_dir --corpus training_corpus -f en -e fi -lm language_model

mert-moses.pl --mertdir=PATH-TO-MERT-DIR input_text references decoder_executable decoder.ini

reuse-weights.perl tuning_dir/moses.ini < model/moses.ini > moses.weight-reused.ini

filter-model-given-input.pl filter_dir moses.weight-reused.ini corpus/eval.input 

moses -f filter_dir/moses.ini -mbr -drop-unknown < corpus/eval.input > corpus/eval.out

