This project is developed based on Fairseq. 
An example training script to train an MoE Transformer language model with 16 experts is as follows, where `{savedir}` means the directory to save the checkpoints and logs, `{jobname}` means the name of this run, `{DATADIR}` means the directory that contains the Fairseq-preprocessed training data. 

```
- mkdir -p {savedir}/checkpoints/{jobname}
- python -m torch.distributed.launch \
    --nproc_per_node=16 \
    train.py {DATADIR} \
    --task language_modeling \
    --save-dir {savedir}/checkpoints/{jobname} \
    --arch transformer_lm_BaseGPT_x1_medium \
    --moe-type base_layer \
    --two-stage-updates 6000 \
    --distill-assignment \
    --distilled-model wordemb \
    --distill-factor 0.3 \
    --criterion xentropy_aux \
    --balance-loss balance \
    --balance-factor 0.3 \
    --capacity-factor 2 \
    --assignment-algorithm GA \
    --share-decoder-input-output-embed \
    --optimizer adam \
    --adam-betas '(0.9, 0.98)' \
    --clip-norm 0.1 \
    --lr 0.0003 \
    --lr-scheduler polynomial_decay \
    --total-num-update 60000 \
    --warmup-updates 2000 \
    --tokens-per-sample 1024 \
    --sample-break-mode none \
    --batch-size 1 \
    --pad-to-fixed-length \
    --pad-to-fixed-bsz \
    --update-freq 8 \
    --max-update 60000 \
    --ddp-backend=legacy_ddp \
    --log-interval 100 \
    --log-file {savedir}/checkpoints/{jobname}/log.txt \
    --log-format tqdm \
    --validate-interval-updates 500 \
    --save-interval 5 \
    --tensorboard-logdir {savedir}/tblogs/{jobname} \
    --distributed-no-spawn \
    --fp16-no-flatten-grads \
    --fp16 \
```

