CREATION OF GER_TIGER_HALF_LC.GR

DATA

Commands used to create tiger_release_july03_train.mrg (training data) and tiger_release_july03_test.mrg (testing data):

iconv -f iso-8859-1 -t utf-8 ~/tiger/corpus/tiger_release_july03.penn > tiger_release_july03.penn
sed -n '1,293585 p' tiger_release_july03.penn > tiger_release_july03_train1.penn
sed -n '293586,368957 p' tiger_release_july03.penn > tiger_release_july03_test.penn
sed -n '368958,777570 p' tiger_release_july03.penn > tiger_release_july03_train2.penn
cat tiger_release_july03_train1.penn tiger_release_july03_train2.penn > tiger_release_july03_train.penn
sed 's/(\([^ ]*\)-\([A-Z][A-Z]*\)/(\1*\2/g' tiger_release_july03_train.penn | sed 's/($(/($LRB/g' | sed 's/^($/((/g' | sed 's/^)$/))/g' | grep -v '^$' | grep -v '^%' | sed 's/./\L\0/g' > tiger_release_july03_train.mrg
sed 's/(\([^ ]*\)-\([A-Z][A-Z]*\)/(\1*\2/g' tiger_release_july03_test.penn | sed 's/($(/($LRB/g' | sed 's/^($/((/g' | sed 's/^)$/))/g' | grep -v '^$' | grep -v '^%' | sed 's/./\L\0/g' > tiger_release_july03_test.mrg


TRAINING

Command:
java -Xmx15000m -cp BerkeleyParser.jar edu.berkeley.nlp/PCFGLA.GrammarTrainer -path tiger_release_july03_train.mrg -out ger_tiger.gr -treebank SINGLEFILE

Training was killed after 5 rounds completed as the grammar had started to overfit (see testing results below).


TESTING

Each of the intermediate files ger_tiger.gr_N_smoothing.gr (for N=1,...,5) was evaluated on the test set to check for overfitting.

Command:
java -Xmx15000m -cp BerkeleyParser.jar edu.berkeley.nlp.PCFGLA.GrammarTester -treebank SINGLEFILE -path tiger_release_july03_test.mrg -in ger_tiger.gr_N_smoothing.gr

Results (second last line of testing output):

N=1: [Average]  P: 62.71 R: 64.73 F1: 63.7 EX: 18.67
N=2: [Average]  P: 67.01 R: 69.05 F1: 68.02 EX: 22.73
N=3: [Average]  P: 67.67 R: 69.96 F1: 68.8 EX: 23.47
N=4: [Average]  P: 67.59 R: 70.23 F1: 68.88 EX: 24.1
N=5: [Average]  P: 66.77 R: 69.4 F1: 68.06 EX: 23.73

ger_tiger.gr_4_smoothing.gr was used as the final ger_tiger_half_lc.gr
