-------------------------------------------------
- Domain Adaptation Experiment code for         -
-  "Pointwise Prediction for Robust, Adaptable  -
-    Japanese Morphological Analysis"           -
-------------------------------------------------

This code can be used to reproduce our experiments for the paper
"Pointwise Prediction for Robust, Adaptable Japanese Morphological
Analysis." In order to do so, please perform the following steps.

1) Make a directory "data", and place the training and testing data
inside of this directory. The necessary data sets are:
  
  * gen-01-09.wordpart: General domain training data
  * gen-10.wordpart: General domain testing data
  * gen-01-08.wordpart: 8/9 of the general domain data for training
      models to test parameters on held-out data
  * gen-09.wordpart: 1/9 of the general domain training data held out
      for testing
  * tar-01-09.wordpart: Target domain testing data
  * tar-10.wordpart: Target domain testing data

  All data must be in the format of
    宝石/名詞 を/助詞 磨/動詞 く/語尾
    (word/POS word/POS word/POS word/POS)
  
  Unfortunately, we do not have permission to distribute the Balanced
  Corpus of Contemporary Japanese (BCCWJ), which was used in these
  experiments, but we will be happy to provide our pre-processing scripts
  or the experimental data to those who can show that they have 
  permission to use the corpus.

2) Install the necessary software for learning the models:
    MeCab:     http://sourceforge.net/projects/mecab/files/
    LIBLINEAR: http://www.csie.ntu.edu.tw/~cjlin/liblinear/
    CRFSuite:  https://github.com/chokkan/crfsuite/

    Adjust the variables at the beginning of build-model.pl and analyze.pl
    so that they point to appropriate location. CRFSuite and MeCab should
    be OK with no adjustment if they are installed with the default
    settings, but LIBLINEAR will likely require adjustment. In addition,
    the software has the ability to learn models using CRF++, Classias,
    and KyTea, but these achieved inferior results or took too long to
    train, so they are omitted from the paper.

3) Make the directory "exp" which will hold the experimental results.

4) Run experiments using ./process.pl. process.pl has several variables
     that determine which type of model to run.

   ./process.pl PROGRAM SOLVER TYPE CRITERION

    PROGRAM: The program to be used in training
                  (liblinear/crfsuite/mecab/crfpp,classias,kytea)
    SOLVER: The type of solver to use (varies by program)
                  (lrprimal/lrdual/lbfgs/mira/sgd)
    TYPE: The type of active learning to use
            part -> Perform partial annotation
            full -> Perform full annotation, starting at the beginning
                    of the corpus
            sent -> Perform full annotation, selecting sentences using
                    active learning
            dict -> Perform adaptation by adding words to the dictionary
                    selecting words with the lowest probability
    CRITERION: The criterion to use when picking spots to annotate
            margin -> The probability margin P(y_1)-P(y_2), where y_1 is the
                      most probable candidate, and y_2 is the second most
                      probable candidate, use with "part" or "dict"
            tot -> Pick sentences using the total posterior probability of
                   the Viterbi analysis, use with "sent"
            avg -> Pick sentences using the average marginal probability of
                   each word or word boundary "avg"

  The actual values used in the reported results are

  ./process.pl liblinear lrdual part margin
  ./process.pl liblinear lrdual sent avg
  ./process.pl crfsuite lbfgs sent avg
  ./process.pl mecab lbfgs sent avg

  although most other possible combinations were tested and found to 
  achieve inferior results.

  Note that process.pl first tunes the hyperparameter of the learning
  algorithm on the held-out data, then performs an active learning experiment
  on the full data set.

5) Tabulate the results. This can be done by running

  grep F-meas exp/*/*/tar-10.grade | parsegrade.pl > results.csv

  And viewing results.csv in any spreadsheet viewer.
