This package contains the code needed to replicate main results of the EMNLP-13 paper A Joint Learning Model of Word Segmentation, Lexical Acquisition, and Phonetic Variability, by Elsner, Goldwater, Feldman and Wood. Please cite this paper when using the software.
An up-to-date version will be maintained at https://bitbucket.org/melsner/beamseg
You'll need the Gnu Scientific Library and Boost. If you're not on a 64-bit machine, make directories lib32 and bin32
Then type make
You'll get some warnings, but it should all build.
We've provided the transducer files used in this study.
For convenience, we also included a download of the modified-Brent dataset files provided by Elsner et al at http://www.ling.ohio-state.edu/~melsner/resources/acl12data.tgz and their readme file.
To run the model with EM:
bin64/BeamSample --alpha 3000 --alpha 100 data/brentStest.surface --channel phone --read-channel data/initialChannel --write-channel output.channel --output output --em
This should replicate our main results in Table 1. Be aware that it will take substantial time and memory. The beam sampler makes the program faster but it doesn't make it fast.
To run the baseline, use --channel none. To run the oracle, use --read-channel data/oracleChannel. To run unigrams, use:
bin64/BeamSample --grams 1 --alpha 20 data/brentStest.surface --channel phone --read-channel data/oracleChannel --output output.unigrams
Notice that:
- --grams sets the ngram size--- but gram sizes above 2 are not supported!
- --alpha sets the alpha parameters of the Dirichlet processes; for bigrams, pass --alpha twice, with the first one being A0 and the second one A1
- --channel selects the channel type
- --read-channel selects a channel file (if needed)
- --write-channel writes the learned channel
- --em learns the channel
- --lock-bounds does not sample word boundaries
- --output selects the filename to write
- --brent-reader reads data with one char per character, words delimited by spaces (as in the Goldwater distribution of Brent)
- --help prints a list of command-line options (not all of them actually supported)
Your runs above should generate files output.learned.surface and output.learned.underlying; the scorer tools take the stem output and search for files with names in this format.
Run:
python script/scoreSeg.py data/brentStest output 1311 true words 1025 found mapped words 468 matched [not useful: unmapped tok] UP 47.78 UR 48.49 UF 48.13 [maptok] MP 49.15 MR 49.88 MF 49.51 [surf] SP 66.85 SR 67.84 SF 67.34 [bds] BP 80.92 BR 82.63 BF 81.77 [not useful: unmapped lex] LP 38.05 LR 29.75 LF 33.39 [maplex] MLP 45.66 MLR 35.70 MLF 40.07
You can obtain the error analyses we use by running:
python script/reportRealTokens.py data/brentStest output
Viewing transducers:
bin64/PrintChannel data/oracleChannel
Making an oracle transducer:
bin64/OracleTransducer --underlying data/brentStest.underlying data/brentStest.surface --channel-file data/initialChannel > oracleChannel
Making an initial transducer:
bin64/InitialTransducer data/brentStest.surface > initialChannel
We've given you our test output in the test-runs directory.
We didn't make much of an effort to remove all the developmental dead ends, test options and so forth from this package. Apart from the commands in this file, none of the other options for the various scripts or programs are guaranteed to not crash, to do what they say they do, or to do anything useful at all.
Code cleanups and more documentation will hopefully be forthcoming.