CREATION OF GER_TIGER_HALF.GR

DATA

Commands used to create tiger_release_july03_train.mrg (training data) and tiger_release_july03_test.mrg (testing data):

sed -n '1,293585 p' tiger_release_july03.penn > tiger_release_july03_train1.penn
sed -n '293586,368957 p' tiger_release_july03.penn > tiger_release_july03_test.penn
sed -n '368958,777570 p' tiger_release_july03.penn > tiger_release_july03_train2.penn
cat tiger_release_july03_train1.penn tiger_release_july03_train2.penn > tiger_release_july03_train.penn
sed 's/(\([^ ]*\)-\([A-Z][A-Z]*\)/(\1*\2/g' tiger_release_july03_train.penn | sed 's/($(/($LRB/g' | sed 's/^($/((/g' | sed 's/^)$/))/g' | grep -v '^$' | grep -v '^%' > tiger_release_july03_train.mrg
sed 's/(\([^ ]*\)-\([A-Z][A-Z]*\)/(\1*\2/g' tiger_release_july03_test.penn | sed 's/($(/($LRB/g' | sed 's/^($/((/g' | sed 's/^)$/))/g' | grep -v '^$' | grep -v '^%' > tiger_release_july03_test.mrg


TRAINING

Command:
java -Xmx15000m -cp BerkeleyParser.jar edu.berkeley.nlp/PCFGLA.GrammarTrainer -path tiger_release_july03_train.mrg -out ger_tiger.gr -treebank SINGLEFILE

Training was killed after 5 rounds completed as the grammar had started to overfit (see testing results below).


TESTING

Each of the intermediate files ger_tiger.gr_N_smoothing.gr (for N=1,...,5) was evaluated on the test set to check for overfitting.

Command:
java -Xmx15000m -cp BerkeleyParser.jar edu.berkeley.nlp.PCFGLA.GrammarTester -treebank SINGLEFILE -path tiger_release_july03_test.mrg -in ger_tiger.gr_N_smoothing.gr

Results (second last line of testing output):

N=1: [Average]  P: 63.27 R: 65.4 F1: 64.32 EX: 19.51
N=2: [Average]  P: 67.26 R: 69.55 F1: 68.38 EX: 22.99
N=3: [Average]  P: 68.45 R: 70.53 F1: 69.47 EX: 23.41
N=4: [Average]  P: 68.65 R: 70.86 F1: 69.74 EX: 23.31
N=5: [Average]  P: 68.58 R: 70.84 F1: 69.69 EX: 23.78

ger_tiger.gr_4_smoothing.gr was used as the final ger_tiger_half.gr.
