# MoNoise Readme #

Monoise is a lexical normalization model for Twitter (but could be used for other domains). In short it's task is to convert:

new pix comming tomoroe

to:

new pictures coming tomorrow

This model achieves a recall of 84% on the LexNorm corpus using only the highest ranking candidate (in +- 5 seconds + 20 for loading the models). It achieves an F1 of 82 on the lexnorm2015 corpus.

A short abstract of the model:

This model generates candidates using the Aspell spell
checker and a word embedding model trained on
Twitter data. Features from the generation are then
complemented with n-gram probabilities of canonical text
and the Twitter domain. A random forest classifier is
exploited for the ranking of the generated candidates.


### Requirements ###

* A recent c++ compiler (>=c++11)
* 8gb ram (20gb for full training, smaller models can be trained in less)
* Input data

### Installation ###

* First download the necessary data/models using prep.sh"
```
#!bash
> ./scripts/prep.sh
```

* Adjust the config file if necessary (just make sure all the files are present):
```
#!bash
numTrain 2000
numDev 2000
w2v cached enData/w2v.bin
twitter cached enData/twitter.bin
wiki cached enData/wiki.bin
knowns raw enData/knowns
nes raw enData/nes.txt
dict raw enData/aspell
aspell en_US ./enData/extraDictEn normal
```
Don't change the order, or the first word. If you want to use your own n-grams, your config should look like this: (if enData/google.1 and enData/google.2 exist)
```
#!bash
numTrain 2000
numDev 2000
w2v cached enData/w2v.bin
twitter cached enData/twitter.bin
wiki raw enData/google
knowns raw enData/knowns
nes raw enData/nes.txt
dict raw enData/aspell
aspell en_US ./enData/extraDictEn normal
```
Note that the first two numbers are the amount of sentences used for training and development.

Compile (edit icmconf to find the right c++ version if necessary):
```
#!bash
> icmbuild
```
If icmbuild is not available and you do not want to install it (sudo apt-get install icmake) :
```
#!bash
g++ --std=c++14 -Wall \*cc \*/\*cc -lpthread -lm headers/libaspell.so.15.1.5
```

### Run the system ###

Just run the binary to see the possible options:
```
p270396@vesta1:monoise$ ./tmp/bin/binary
USAGE: ./monoise [options]

Options:
  -h         --help          Print usage and exit.

  -m <arg>   --mode=<arg>    Where arg = TRain, TEst, DEv, RUn, INteractive
                             (Required); DEv is equal to TEst, but only uses
                             part of the corpus (based on the config file).

  -r <arg>   --rf=<arg>      Path to the forest regressor. (Required, except
                             when using -u).

  -i <arg>   --input=<arg>   expects input in lexnorm (3 collumn) format: <word>
                             <spacefiller> <normalization>, when using TR, DE or
                             TE. For RU tokenized text is expected. Reads from
                             stdin if not used.

  -o <arg>   --output=<arg>  File to write to, when TEsting it writes the
                             results, and when RUnning it writes the
                             normalization. Writes to stdout if not specified.

  -f <arg>   --feats=<arg>   Specify the features to use. Should be the same as
                             the trained model!. Expects a boolean string,
                             default: 11111111. See model1.cc for possible
                             features.

  -c <arg>   --cands=<arg>   Specify the number of candidates outputted when
                             using RU.

  -k         --unk           Consider only unknown words for normalization. The
                             list of known words can be specified in the config
                             file. Note that this should probably also match
                             during training and testing/running.

  -a         --caps          Consider capitals. Most corpora don't use capitals
                             in gold data, so by default the input is converted
                             to lowercase, and evaluation is done with ignoring
                             capitals.

  -g         --gold          Assume gold error detection. Can not be used with
                             -m RUn, since it typically isnt available.

  -w         --weight        Extra weight given to original word.

  -n <arg>   --nThreads=<arg>Number of threads used in the classifier
                             (default=4).

  -l <arg>   --lookup=<arg>  Specify lookup file generated from another corpus.

  -t         --tokenize      Enable rule based tokenization.

  -d         --dutch         Dutch: Read configuration from config.nl instead of
                             config.en.

  -u         --uns           Unsupervised, rank using the forward backward
                             algorithm.

  -p <arg>   --parse=<arg>   Evaluate the parser. Argument should be the path to
                             the gold treebank.

  -s         --seed=<arg>    Seed used for random forest classifier (default=8).

  -v         --verbose       Print debugging info. NF = Not Found, NN = Not
                             Normalized, WR = Wrong Ranking.
```

I've included some example usages in the scripts folder:

* ./scripts/lexnorm2015.sh: test performance on lexnorm2015 dataset 
* ./scripts/lexnorm.sh: test performance on lexnorm1.2
* ./scripts/parse.sh: test performance on foster's twitter constituency treebank using normalization
* ./scripts/testAll.sh: tests various settings


### Reference? ###

An old version of this system is described in:

Normalizing social media texts by combining word embeddings and edit distances in a random forest regressor. (van der Goot, 2016)

http://www.let.rug.nl/rob/doc/normsome2016.pdf

http://www.let.rug.nl/rob/doc/normsome2016.bib

I am writing a more up to date version at the moment, contact me for more info.


I made use of six other open source projects:

* word2vec: https://code.google.com/archive/p/word2vec/
* aspell: http://aspell.net/
* ranger: https://github.com/imbs-hl/ranger
* the lean mean c++ option parser: http://optionparser.sourceforge.net/
* evalb: http://nlp.cs.nyu.edu/evalb/
* Berkeleyparser: https://github.com/slavpetrov/berkeleyparser

### Contact ###

Did you encounter any problems with the installation/running of this software. Or do you want to train the model using your own n-grams/word embeddings, don't hesitate to contact me:
r.van.der.goot@rug.nl

### Problems ###
The parser is written in Java, the normalization system communications with it through sockets. Note that when running from a bash script, it is probably better to insert a "sleep 10" between 2 instantiations of the application, this makes sure the port is not used anymore.

If it still does not work, use -m RUn in combination with -c <num> to generate an n-best normalization output. Then parse the result:
```
#!bash
./tmp/bin/binary -m RU -i data -r working/lexnorm -c 6 | java -jar util/BerkeleyGraph.jar -gr enData/ewtwsj.gr -latticeWeight 2
```
