CREATION OF GER_TIGER_10PC.GR

DATA

Commands used to create tiger_release_july03_train.mrg (training data) and tiger_release_july03_test.mrg (testing data):

sed -n '1,140556 p' tiger_release_july03.penn > tiger_release_july03_train.penn
sed -n '293586,368957 p' tiger_release_july03.penn > tiger_release_july03_test.penn
sed 's/(\([^ ]*\)-\([A-Z][A-Z]*\)/(\1*\2/g' tiger_release_july03_train.penn | sed 's/($(/($LRB/g' | sed 's/^($/((/g' | sed 's/^)$/))/g' | grep -v '^$' | grep -v '^%' > tiger_release_july03_train.mrg
sed 's/(\([^ ]*\)-\([A-Z][A-Z]*\)/(\1*\2/g' tiger_release_july03_test.penn | sed 's/($(/($LRB/g' | sed 's/^($/((/g' | sed 's/^)$/))/g' | grep -v '^$' | grep -v '^%' > tiger_release_july03_test.mrg


TRAINING

Command:
java -Xmx15000m -cp BerkeleyParser.jar edu.berkeley.nlp/PCFGLA.GrammarTrainer -path tiger_release_july03_train.mrg -out ger_tiger.gr -treebank SINGLEFILE


TESTING

Each of the intermediate files ger_tiger.gr_N_smoothing.gr (for N=1,...,5) was evaluated on the test set to check for overfitting.

Command:
java -Xmx15000m -cp BerkeleyParser.jar edu.berkeley.nlp.PCFGLA.GrammarTester -treebank SINGLEFILE -path tiger_release_july03_test.mrg -in ger_tiger.gr_N_smoothing.gr

Results (second last line of testing output):

N=1: [Average]  P: 58.72 R: 60.7 F1: 59.69 EX: 16.19
N=2: [Average]  P: 61.33 R: 62.94 F1: 62.13 EX: 17.87
N=3: [Average]  P: 61.56 R: 63.01 F1: 62.27 EX: 18.09
N=4: [Average]  P: 60.61 R: 61.85 F1: 61.22 EX: 17.77
N=5: [Average]  P: 60.39 R: 61.6 F1: 60.99 EX: 18.03
N=6: [Average]  P: 60.42 R: 61.39 F1: 60.9 EX: 17.51

ger_tiger.gr_3_smoothing.gr was used as the final ger_tiger_10pc.gr.
