CREATION OF GER_TIGER_LC.GR

DATA

Commands used to create tiger_release_july03_train.mrg (training data) and tiger_release_july03_test.mrg (testing data):

iconv -f iso-8859-1 -t utf-8 ~/tiger/corpus/tiger_release_july03.penn > tiger_release_july03.penn
sed -n '1,293585 p' tiger_release_july03.penn > tiger_release_july03_train1.penn
sed -n '293586,368957 p' tiger_release_july03.penn > tiger_release_july03_test.penn
sed -n '368958,$ p' tiger_release_july03.penn > tiger_release_july03_train2.penn
cat tiger_release_july03_train1.penn tiger_release_july03_train2.penn > tiger_release_july03_train.penn
sed 's/(\([^ ]*\)-\([A-Z][A-Z]*\)/(\1*\2/g' tiger_release_july03_train.penn | sed 's/($(/($LRB/g' | sed 's/^($/((/g' | sed 's/^)$/))/g' | grep -v '^$' | grep -v '^%' | sed 's/./\L\0/g' > tiger_release_july03_train.mrg
sed 's/(\([^ ]*\)-\([A-Z][A-Z]*\)/(\1*\2/g' tiger_release_july03_test.penn | sed 's/($(/($LRB/g' | sed 's/^($/((/g' | sed 's/^)$/))/g' | grep -v '^$' | grep -v '^%' | sed 's/./\L\0/g' > tiger_release_july03_test.mrg


TRAINING

Command:
java -Xmx15000m -cp BerkeleyParser.jar edu.berkeley.nlp/PCFGLA.GrammarTrainer -path tiger_release_july03_train.mrg -out ger_tiger.gr -treebank SINGLEFILE

Training was killed after 5 rounds completed as the grammar had started to overfit (see testing results below).


TESTING

Each of the intermediate files ger_tiger.gr_N_smoothing.gr (for N=1,...,5) was evaluated on the test set to check for overfitting.

Command:
java -Xmx15000m -cp BerkeleyParser.jar edu.berkeley.nlp.PCFGLA.GrammarTester -treebank SINGLEFILE -path tiger_release_july03_test.mrg -in ger_tiger.gr_N_smoothing.gr

Results (second last line of testing output):

N=1: [Average]  P: 65.43 R: 67.23 F1: 66.32 EX: 21.99
N=2: [Average]  P: 68.42 R: 70.7 F1: 69.54 EX: 24.15
N=3: [Average]  P: 71.09 R: 73.16 F1: 72.11 EX: 26.84
N=4: [Average]  P: 70.61 R: 73.19 F1: 71.88 EX: 27.1
N=5: [Average]  P: 69.79 R: 72.46 F1: 71.1 EX: 26.52

ger_tiger.gr_3_smoothing.gr was used as the final ger_tiger_lc.gr
