CREATION OF GER_TIGER_25PC.GR

DATA

Commands used to create tiger_release_july03_train.mrg (training data) and tiger_release_july03_test.mrg (testing data):

sed -n '1,293585 p' tiger_release_july03.penn > tiger_release_july03_train1.penn
sed -n '293586,368957 p' tiger_release_july03.penn > tiger_release_july03_test.penn
sed -n '368958,423279 p' tiger_release_july03.penn > tiger_release_july03_train2.penn
cat tiger_release_july03_train1.penn tiger_release_july03_train2.penn > tiger_release_july03_train.penn
sed 's/(\([^ ]*\)-\([A-Z][A-Z]*\)/(\1*\2/g' tiger_release_july03_train.penn | sed 's/($(/($LRB/g' | sed 's/^($/((/g' | sed 's/^)$/))/g' | grep -v '^$' | grep -v '^%' > tiger_release_july03_train.mrg
sed 's/(\([^ ]*\)-\([A-Z][A-Z]*\)/(\1*\2/g' tiger_release_july03_test.penn | sed 's/($(/($LRB/g' | sed 's/^($/((/g' | sed 's/^)$/))/g' | grep -v '^$' | grep -v '^%' > tiger_release_july03_test.mrg


TRAINING

Command:
java -Xmx15000m -cp BerkeleyParser.jar edu.berkeley.nlp/PCFGLA.GrammarTrainer -path tiger_release_july03_train.mrg -out ger_tiger.gr -treebank SINGLEFILE


TESTING

Each of the intermediate files ger_tiger.gr_N_smoothing.gr (for N=1,...,5) was evaluated on the test set to check for overfitting.

Command:
java -Xmx15000m -cp BerkeleyParser.jar edu.berkeley.nlp.PCFGLA.GrammarTester -treebank SINGLEFILE -path tiger_release_july03_test.mrg -in ger_tiger.gr_N_smoothing.gr

Results (second last line of testing output):

N=1: [Average]  P: 61.32 R: 63.39 F1: 62.34 EX: 17.82
N=2: [Average]  P: 64.92 R: 66.83 F1: 65.86 EX: 20.62
N=3: [Average]  P: 65.24 R: 67.13 F1: 66.17 EX: 21.36
N=4: [Average]  P: 65.15 R: 67.05 F1: 66.08 EX: 21.04
N=5: [Average]  P: 64.49 R: 66.38 F1: 65.42 EX: 20.93
N=6: [Average]  P: 64.58 R: 66.54 F1: 65.54 EX: 20.93

ger_tiger.gr_3_smoothing.gr was used as the final ger_tiger_25pc.gr.
