# Run a Sample of Field Embedding with Sample Data of Chinese Wikipedia

## 1. Install the required dependencies.
The dependencies include `smart_open`, `jieba`, `gensim`, `Cython` and so on.
Anaconda Environment is recommended.

## 2. Prepare Dataset


First remove the data folder `corpus` to this folder, i.e., `fieldembed_sample_code`.
```
# build the corpus and its field information, # prepare the dataset, including wikipedia sample and ner datasets.
python run_nlptext.py 
```



## 3. Build Model with Cython and Train Embeddings

Install the `fieldembed` with the following command.
```shell
cd fieldembed_sample_code
python setup.py build_ext --inplace
```

# 4. Training the fieldembed.
Training:
```
python train.py -f 1  # train embeddings with 1 field: 'word',  
python train.py -f 2  # train embeddings with 2 field: 'word', 'subcomp',  
python train.py -f 3  # train embeddings with 3 field: 'word', 'subcomp', 'pinyin' 
python train.py -f 4  # train embeddings with 4 field: 'word', 'subcomp', 'pinyin', 'pos'
```

## 3. Lexcial Evaluation


Here we provide `English Wikipedia Field Embeddings` as examples. 
It is stored in the `embeddings` folder. 
```shell
python lexical_eval.py
```

## 4. NER Tasks.

In `script_train/data_config`, change the input linguistic fields.

```python
# English
input_fields =  ['token', 'char', 'phoneme']  # orderly select fields from ['token', 'char', 'phoneme', 'pos_en']

# pretrain_embeddings = 'WikiChinese/char'; SIZE = 200 
pretrain_embeddings = 'WikiEnglish/word'; SIZE = 200 

############## Open Domain Corpus
# Data_Dir = 'data/boson/char/'; min_token_freq = 1 # don't touch this
Data_Dir = 'data/CoNLL-2003/word/'; min_token_freq = 3
```

### Model Structures


Using `Embed-CRF` structure:

```python
SeqRepr_Config_Name = 'EmbedOnly' #, 'BaseStruct'
# 'BaseStruct', 'EmbedOnly',
```

Using `Embed-BiLSTM-CRF` structure:


```python
SeqRepr_Config_Name = 'BaseStruct' #, 'EmbedOnly'
# 'BaseStruct', 'EmbedOnly',
```

### Training Parameters

Modify the training hyperparameters in `script_train/train_config.py`.


### Train the Model

Run NER tasks with:
```
python train_ner.py
```

