CoordinateNPDataAndCode.2011/README
Shane Bergsma
March 15, 2011

In order to comply with the 10MB supplementary submission
requirements, the version of the data submitted with the ACL paper
does not include the N-gram counts.  The full set of data can be found
at: http://www.clsp.jhu.edu/~sbergsma/coordNP.ACL11.zip

Also, the documentation on the above page may be improved over time
based on user feedback; the version submitted with the ACL paper will
be static.

1) Introduction

This directory contains data and code used in the experiments for
"Using Large Monolingual and Bilingual Corpora to Improve Coordination
Disambiguation."  If you use this material in your work, please cite
as:

Shane Bergsma, David Yarowsky and Kenneth Church, Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation, Proc. ACL-HLT 2011, Portland, Oregon, June 2011.

Please e-mail sbergsma@jhu.edu if you have any questions about the
data or the scripts.

2) File Organization Overview

Do/ -- scripts for running the experiments
Originals/ -- original annotated data
README -- this file
Scripts/ -- scripts used in the experiments
Tools/ -- general purpose scripts used in learning

Data/ -- contains all the experimental data
 Data/Cotrain/FVs    -- stores generated feature vectors
 Data/Cotrain/       -- this and the following store temporary 
 Data/Cotrain/Cache     data: predictions, model weights, etc.
 Data/Cotrain/ML
 Data/Cotrain/ML_CT

3) Main Data

a) Original labeled and unlabeled coordinate NP examples:

Originals/bitext.* --> Data used in Section 7
Originals/wsj.* --> Data used in Section 8

Example:

1 mental/JJ and/CC physical/JJ health/NN
0 pilots/NNS and/CC gate/NN agents/NNS

Format:

LABEL word/tag word/tag word/tag ...

b) Examples including monolingual and bilingual context:

Data/WSJExamples/wsj.*
Data/BitextExamples/bitext.*

Example:

(wsj):
1	mental/JJ and/CC physical/JJ health/NN	PrT=about/IN^FoT=./.:1 PrT=about/IN:1 FoT=./.:1 
0	pilots/NNS and/CC gate/NN agents/NNS	PrT=,/,^FoT=dressed/VBN:1 FoT=dressed/VBN:1 PrT=,/,:1 
(bitext):
1	iron/NN and/CC steel/NN plants/NNS	FoT=from/IN:8 PrT=by/IN^FoT=from/IN:8 PrT=by/IN:8 	SPN.fi=0:1 fi:1 SPN.nl=1:1 1H:1 12:1 de:1 Ho_2o_it:1 Ho_2o_es:1 H2^it:1 H-2-!es:1 H2^es:1 1-H-!da:1 SPN.sv=3:1 2^nl:1 H-2-!fr:1 12^sv:1 H2^fr:1 ^de:1 1o-_o_o_Ho_da:1 !fi:1 H2:3 Ho_2o_fr:1 2o_nl:1 SPN.fr=2:1 1-2-!sv:1 SPN.es=2:1 SPN.da=4:1 SPN.it=2:1 2-!nl:1 2:1 SPN.de=0:1 H-2-!it:1 ^fi:1 1H^da:1 1o-_o_2o_sv:1 !de:1 

Format:

Label   word/tag word/tag word/tag ...   Monolingual-Context   [Bilingual-Context]

Note the bilingual context only occurs in the bitext files.

Explanation:

<Monolingual Context>

The monolingual context denotes the words that occur before (PrT=X
i.e.  "previous term equals X") and after (FoT= i.e. "following term
equals X") the example in the original English text.  After the colon
':' is the count of how often that term occurred.  We also encode how
often a preceding term and the following term occurred together, so
PrT=by/IN^FoT=from/IN:8 means that this example ("iron and steel
plants") was both preceded by 'by' and followed by 'from' 8 times in
the source corpus.

<Bilingual Context> 

This context denotes how the example was translated in the bitext.
The creation of these contextual features are described in detail in
the paper.  For example, SPN means how many tokens the example
'spans', while "1o-_o_o_Ho_da" means that in Danish, the example "iron
and steel plants" was translated with the first word, "iron," coming
first, followed by a hyphen followed by other words, followed by the
head, "plants" [the second word was not aligned in this particular
case].'  Referring to Table 4 and the surrounding discussion in our
paper, the "ord" pattern is marked with a carat '^', the "simp"
pattern is marked with a '!' and the detailed pattern has no
punctuation (or just an underscore) before the language tag (here,
'_da').

4) Generating FVs and running the code

Note: To reproduce our experiments exactly, you'll need to download
and install LibLinear http://www.csie.ntu.edu.tw/~cjlin/liblinear/
(but any other learning algorithm would probably be fine, with the
appropriate changes).  To run our scripts, you need to change
runTrainTestLR.pl to tell it where to find LibLinear (at the point
marked "#HERE#" in the script).

Otherwise, you can reproduce the FVs and our results by running the
following in order:

a) Non-Cotraining Predictions (but creates data used by co-training):

Do/runMono1.sh
-- train a monolingual classifier on WSJ data and test on everyone

Do/runBitext1.sh
-- train a monolingual classifier on bitext data and test on everyone

Do/runBitext2.sh
-- train and test a bilingual-feature classifier on bitext data

Do/runBitextBoth.sh
-- train and test a classifier using mono+bili views on bitext data

b) Cotraining predictions (uses data created above):

Do/runBitextCoTrain.sh {2,10,100}
-- run the co-training starting from 2, 10, or 100 initial examples

If you have any questions or comments, please contact Shane Bergsma at
sbergsma@jhu.edu.
