--------------------------------------------------------------------------------------
Matlab Codes to demonstrate the fixed-size ordinally-forgetting encoding (FOFE) method
on two text corpora, PTB and Wiki.

      by Shiliang Zhang and Mingbin Xu  on Apr 20, 2015
	  
Citation: if you find this code useful, please cite the following paper:

[1] ShiLiang Zhang, Hui Jiang, MingBin Xu, JunFeng Hou, LiRong Dai, "A Fixed-Size Encoding Method 
for Variable-Length Sequences with its Application to Neural Network Language Models," arXiv:1505.01504.
-------------------------------------------------------------------------------------------

Code Directory:

  1. RL_DNN_LM:  run RL_DNN_LM/RL_DNN_LM_demo.m to train the baseline FNN-LMs on PTB data set
  2. FOFE_RL_DNN_LM: run FOFE_RL_DNN_LM/FOFE_RL_DNN_LM_demo.m to train FOFE based FNN-LMs on PTB data set
  3. RL_DNN_LM_no_momentum: run RL_DNN_LM_no_momentum/RL_DNN_LM_demo.m to train the baseline FNN-LMs on wiki (LTCB) data set
  4. FOFE_RL_DNN_LM_no_momentum: run FOFE_RL_DNN_LM_no_momentum/FOFE_RL_DNN_LM_demo.m to train FOFE FNN-LMs on wiki (LTCB) data set

PTB:

  all Penn TreeBank data sets with all words normalized by number ids 

wikiDataProcessing

  0. downloaded enwik9 from  http://mattmahoney.net/dc/text data.html 
  
  1. run the wiki_norm_iflytekInner.pl  to filter Wikipedia XML dumps to "clean” text.

  2. run splitData.pl to split the database into training/validation/test set.

  3. run countDict.sh to get the full vocabulary list file <dict_train_enWiki.txt>.

  4. limited the vocabulary to 80k  (dict_train_enWiki.80000.txt)

  5. run formatVocab.pl to get the final training/validation/test set.

  6. run wiki.py to train the skip-gram model and to convert the text files into cvs format.