Copyright 2011 Bo HAN. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are
permitted provided that the following conditions are met:

   1. Redistributions of source code must retain the above copyright notice, this list of
      conditions and the following disclaimer.

   2. Redistributions in binary form must reproduce the above copyright notice, this list
      of conditions and the following disclaimer in the documentation and/or other materials
      provided with the distribution.

THIS SOFTWARE IS PROVIDED BY Bo HAN ``AS IS'' AND ANY EXPRESS OR IMPLIED
WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL <COPYRIGHT HOLDER> OR
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

The views and conclusions contained in the software and documentation are those of the
authors and should not be interpreted as representing official policies, either expressed
or implied, of Bo HAN.



This folder contains descriptions about data, code, relevant tools used in "Lexical normalisation of short text messages: Makn sens a #twitter".
-Environment: 
	-General: 
      -Python 2.65
      -Ubuntu 10.04
      -NLTK toolkit [1] for tokenisation in langToolkit.py
      -SRILM [4] for Moses and setting up a service for language model score calculation.
	-For ill-formed detection: 
      -LIBLINEAR [2]
	-For phrase machine translation: 
      -Moses [3]
      -SRILM [4]
      -Giza++ [5]

-Quick start
    -Make sure the environment is satisfied.
    -Train a language model over clean English tweets data with -unk option, and set up sever by "ngram -lm languageModelFile -unk -server-port 2345", the corresponding client side in lm.py "ngram -ppl - use-server 2345@YourServerName -cache-served-ngrams -debug 1".
    -Generation of dependency features
      -Run Stanford parser [9] to obtain dependencies from New York Time data/Blog corpus described in the paper.
      -Run genSupp.py and it will generate a contextSupport.pickle and a contextDect.pickle. Put them in data/ folder
	-Normalization(except SMT):
		Step into code/ folder, and run smsRun.sh/tweetRun.sh accordingly;
		Note: (replace * with concrete name) 
		To get the performance, use "tail -1 *.error".
		To get the BLEU score [6], run "./getBLEU.sh *.test" for all methods except SMT.
		The script of multi-bleu.perl is from Moses.	
    -Normalization SMT:
      Modify the environments in code/smtRun.sh (refer to comments in the script)
		To get the BLEU score of SMT, run "./multi-bleu.perl *.norm < *.predict"
	-Ill-formed word detection:
		Put dependency files and in /data folders.
		Set up /dect and /linear folders, and put compiled LIBLINEAR tool in /linear folder.
		Run experiment.py.

-Code
    framework.py: The experimental framework for comparing different methods.
	libSoundexImp.py: revised double metaphone to obtain phonetic code
		Note: double metaphone is revised for research purpose [7].
	libEditDist.py: to obtain morphological variations
		Note: editing distance is based on Levenshtein distance [8].
	langToolkit.py: to get OOV words; simple lexical tokeniser for tweets
		Note: the tokeniser is revised from NLTK tokeniser.
	lm.py: language model client, it connects server to obtain ngram language model score
	experiment.py: Conduct ill-formed word detection experiments under different parameters
	detectionExpDep.py: extract features for ill-formed word detection
	evalDetection.py: evaluate ill-formed word detection
	removeConfliction.py: Remove conflicting samples
	significance.py: Get randomised significance test
    confusionRecall.py: calculate recall and average confusion candidate numbers
    getOOVDist.py: calculate the OOV word distribution
    textOOVRatio.py: calculate the OOV ratio of local context range
    genSupp.py: convert dependency features into different structural context dictionary for ill-formed word detection and normalisation, respectively.
    oov.r amd dect.r: Generate figures in R
    sms.sh and tweet.sh: bash scripts for running experiments
    getBLEU.sh: used to calculate BLEU score

-Data	
	dict.*: (.dict is the plain text, .pickle is python format file) dictionary we used in experiments, revised from Aspell .
	slang.*: collected Internet slangs
	soundex.dict: a <word, sound representation> dictionary
	unigram.*: unigram dictionary used in baselines
	contextSupport.pickle: for tweets dependency feature generation (Excluded due to data size. It can be generated by genSupp.py)
	contextDect.pickle: for ill-formed word detection (training, testing) (Excluded due to data size. It can be generated by genSupp.py)
    largeEngTweets: Large English tweets data (Excluded due to data size)
    svmtrainset.pickle: Used in generation of training data process with largeEngTweets
	corpus.sms1: SMS corpus generated by formatSMSCorpus.py (Excluded)
    corpus.tweet1: sampled tweets without OOV flags for experiments
	artificial.train: SMT training data (Excluded due to data size)
	artificial.tuning: SMT tuning data (Excluded due to data size)


Reference:
[1] Bird, Steven; Ewan Klein; Edward Loper; Jason Baldridge (2008). Multidisciplinary instruction with the Natural Language Toolkit. Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics, ACL. http://aclweb.org/anthology-new/W/W08/W08-0208.pdf.
[2]R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A Library for Large Linear Classification, Journal of Machine Learning Research 9(2008), 1871-1874. Software available at http://www.csie.ntu.edu.tw/~cjlin/liblinear 
[3] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, Evan Herbst. (2007) "Moses: Open Source Toolkit for Statistical Machine Translation". Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session, Prague, Czech Republic, June 2007.
[4] A. Stolcke. SRILM -- an extensible language modeling toolkit. In Proceedings of International Conference on Spoken Language Processing, pages 901--904, 2002.
[5] Franz Josef Och, Hermann Ney. "A Systematic Comparison of Various Statistical Alignment Models", Computational Linguistics, volume 29, number 1, pp. 19-51 March 2003. 
[6] Papineni, K., Roukos, S., Ward, T., and Zhu, W. J. (2002). "BLEU: a method for automatic evaluation of machine translation" in ACL-2002: 40th Annual meeting of the Association for Computational Linguistics pp. 311–318
[7] "The Double Metaphone Search Algorithm", C/C++ Users Journal, June 2000
[8] Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10:707–710.
[9] Marie-Catherine de Marneffe, Bill MacCartney and Christopher D. Manning. 2006. Generating Typed Dependency Parses from Phrase Structure Parses. In LREC 2006.
