## Pre-training

The pre-training code is based on [faiseq-0.10.2](https://github.com/pytorch/fairseq). We provide the commands for reproducing our prhttps://github.com/pytorch/fairseqe-trained models using 32 Nvidia Tesla V100 32GB.

### Distillation

Before running the below command, you need to have the `model.pt.layer6` under the publicly released XLM-R base model folder. This model is a 6-layer version of the XLM-R base model, and could be obtained by taking one layer out of two from the XLM-R base model, following [Sanh et al. (2020)](https://arxiv.org/pdf/1910.01108.pdf).

    python train.py $YOUR_BINARIZED_DATA_DIR --task multilingual_masked_lm --criterion sparse_masked_lm --sample-break-mode complete --tokens-per-sample 512 --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 --lr-scheduler polynomial_decay --lr 0.0002 --warmup-updates 10000 --total-num-update 300000 --batch-size 4 --update-freq 16 --max-update 300000 --weight-decay 0.01 --arch roberta_base --dropout 0.1 --attention-dropout 0.1 --log-format simple --log-interval 100 --no-epoch-checkpoints --dataset-impl mmap --skip-invalid-size-inputs-valid-test --ddp-backend no_c10d --seed 0 --multilang-sampling-alpha 0.7 --monolingual-langs 'af,am,ar,as,az,be,bg,bn,bn_rom,br,bs,ca,cs,cy,da,de,el,en,eo,es,et,eu,fa,fi,fr,fy,ga,gd,gl,gu,ha,he,hi,hi_rom,hr,hu,hy,id,is,it,ja,jv,ka,kk,km,kn,ko,ku,ky,la,lo,lt,lv,mg,mk,ml,mn,mr,ms,my_zaw,my,ne,nl,no,om,or,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,so,sq,sr,su,sv,sw,ta,ta_rom,te,te_rom,th,tl,tr,ug,uk,ur,ur_rom,uz,vi,xh,yi,zh-Hans,zh-Hant' --restore-file xlmr.base/model.pt.layer6 --save-dir xlmr_base.layer6.kd1.fp16 --tensorboard-logdir log --kd-weight 1 --cos-weight 1 --teacher-file xlmr.base/model.pt --encoder-layers 6 --fp16

### Grad

Before running the below commands, you need to run the scrip `scripts/induce_sparse_model_from_dense.py` to obtain `importance.pt`.

#### Shared

    python train.py $YOUR_BINARIZED_DATA_DIR --task multilingual_masked_lm --criterion sparse_masked_lm --sample-break-mode complete --tokens-per-sample 512 --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 --lr-scheduler polynomial_decay --lr 0.0002 --warmup-updates 10000 --total-num-update 300000 --batch-size 16 --update-freq 4 --max-update 300000 --weight-decay 0.01 --arch sparse_xlmr_base --dropout 0.1 --attention-dropout 0.1 --log-format simple --log-interval 100 --no-epoch-checkpoints --dataset-impl mmap --skip-invalid-size-inputs-valid-test --ddp-backend no_c10d --seed 0 --multilang-sampling-alpha 0.7 --monolingual-langs 'af,am,ar,as,az,be,bg,bn,bn_rom,br,bs,ca,cs,cy,da,de,el,en,eo,es,et,eu,fa,fi,fr,fy,ga,gd,gl,gu,ha,he,hi,hi_rom,hr,hu,hy,id,is,it,ja,jv,ka,kk,km,kn,ko,ku,ky,la,lo,lt,lv,mg,mk,ml,mn,mr,ms,my_zaw,my,ne,nl,no,om,or,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,so,sq,sr,su,sv,sw,ta,ta_rom,te,te_rom,th,tl,tr,ug,uk,ur,ur_rom,uz,vi,xh,yi,zh-Hans,zh-Hant' --restore-file xlmr.base/model.pt --save-dir sparse_xlmr_base.one4all0.1,1.0.share.clamp.fp16 --tensorboard-logdir log --embed-factorize --sparse-impl hard_concrete --init-args "{'score_file':'xlmr.base/importance.pt','step':0.01}" --clamp --lang-agnostic --one4all '[0.1,1.0]' --fp16

**Note**: the above commands already apply Dynamic Sparsification. If you want to pre-train a model with a fixed sparsity, say 50%, please change `--one4all '[0.5]'` and `'step':0.01` in `--init-args` to `'sparsity':0.5`.

#### Non-shared

    python train.py $YOUR_BINARIZED_DATA_DIR --task multilingual_masked_lm --criterion sparse_masked_lm --sample-break-mode complete --tokens-per-sample 512 --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 --lr-scheduler polynomial_decay --lr 0.0002 --warmup-updates 10000 --total-num-update 300000 --batch-size 16 --update-freq 8 --max-update 300000 --weight-decay 0.01 --arch sparse_xlmr_base --dropout 0.1 --attention-dropout 0.1 --log-format simple --log-interval 100 --no-epoch-checkpoints --dataset-impl mmap --skip-invalid-size-inputs-valid-test --ddp-backend no_c10d --seed 0 --multilang-sampling-alpha 0.7 --monolingual-langs 'af,am,ar,as,az,be,bg,bn,bn_rom,br,bs,ca,cs,cy,da,de,el,en,eo,es,et,eu,fa,fi,fr,fy,ga,gd,gl,gu,ha,he,hi,hi_rom,hr,hu,hy,id,is,it,ja,jv,ka,kk,km,kn,ko,ku,ky,la,lo,lt,lv,mg,mk,ml,mn,mr,ms,my_zaw,my,ne,nl,no,om,or,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,so,sq,sr,su,sv,sw,ta,ta_rom,te,te_rom,th,tl,tr,ug,uk,ur,ur_rom,uz,vi,xh,yi,zh-Hans,zh-Hant' --restore-file xlmr.base/model.pt --save-dir sparse_xlmr_base.one4all0.5.clamp.fp16 --tensorboard-logdir log --embed-factorize --sparse-impl hard_concrete --init-args "{'score_file':'xlmr.base/importance.pt','sparsity':0.5}" --clamp --one4all '[0.5]' --fp16

### L0

#### Shared

    python train.py $YOUR_BINARIZED_DATA_DIR --task multilingual_masked_lm --criterion sparse_masked_lm --sample-break-mode complete --tokens-per-sample 512 --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 --lr-scheduler polynomial_decay --lr 0.0002 --warmup-updates 10000 --total-num-update 300000 --batch-size 16 --update-freq 4 --max-update 300000 --weight-decay 0.01 --arch sparse_xlmr_base --dropout 0.1 --attention-dropout 0.1 --log-format simple --log-interval 100 --no-epoch-checkpoints --dataset-impl mmap --skip-invalid-size-inputs-valid-test --ddp-backend no_c10d --seed 0 --multilang-sampling-alpha 0.7 --monolingual-langs 'af,am,ar,as,az,be,bg,bn,bn_rom,br,bs,ca,cs,cy,da,de,el,en,eo,es,et,eu,fa,fi,fr,fy,ga,gd,gl,gu,ha,he,hi,hi_rom,hr,hu,hy,id,is,it,ja,jv,ka,kk,km,kn,ko,ku,ky,la,lo,lt,lv,mg,mk,ml,mn,mr,ms,my_zaw,my,ne,nl,no,om,or,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,so,sq,sr,su,sv,sw,ta,ta_rom,te,te_rom,th,tl,tr,ug,uk,ur,ur_rom,uz,vi,xh,yi,zh-Hans,zh-Hant' --restore-file xlmr.base/model.pt --save-dir sparse_xlmr_base.weight8.diagonal0.init.prior.one4all0.5.fp16 --tensorboard-logdir log --embed-factorize --sparse-impl hard_concrete --sparsity-weight 8 --diagonal-weight 0 --one4all '[0.5]' --lang-agnostic --fp16

#### Non-shared

We first tune the parameters of L0 regularization only.

    python train.py $YOUR_BINARIZED_DATA_DIR --task multilingual_masked_lm --criterion sparse_masked_lm --sample-break-mode complete --tokens-per-sample 512 --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 --lr-scheduler polynomial_decay --lr 0.0002 --warmup-updates 10000 --total-num-update 150000 --batch-size 16 --update-freq 4 --max-update 150000 --weight-decay 0.01 --arch sparse_xlmr_base --dropout 0.1 --attention-dropout 0.1 --log-format simple --log-interval 100 --no-epoch-checkpoints --dataset-impl mmap --skip-invalid-size-inputs-valid-test --ddp-backend no_c10d --seed 0 --multilang-sampling-alpha 0.7 --monolingual-langs 'af,am,ar,as,az,be,bg,bn,bn_rom,br,bs,ca,cs,cy,da,de,el,en,eo,es,et,eu,fa,fi,fr,fy,ga,gd,gl,gu,ha,he,hi,hi_rom,hr,hu,hy,id,is,it,ja,jv,ka,kk,km,kn,ko,ku,ky,la,lo,lt,lv,mg,mk,ml,mn,mr,ms,my_zaw,my,ne,nl,no,om,or,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,so,sq,sr,su,sv,sw,ta,ta_rom,te,te_rom,th,tl,tr,ug,uk,ur,ur_rom,uz,vi,xh,yi,zh-Hans,zh-Hant' --restore-file xlmr.base/model.pt --save-dir sparse_xlmr_base.weight128.diagonal1.init.prior.one4all0.1,1.0.part.fp16 --tensorboard-logdir log --embed-factorize --sparse-impl hard_concrete --sparsity-weight 128 --diagonal-weight 1 --init-args "{'score_file':'xlmr.base/importance.pt','step':0.01}" --lang2group "{'af': 'Indo-European', 'am': 'Afro-Asiatic', 'ar': 'Afro-Asiatic', 'as': 'Indo-European', 'az': 'Turkic', 'be': 'Indo-European', 'bg': 'Indo-European', 'bn': 'Indo-European', 'bn_rom': 'Indo-European', 'br': 'Indo-European', 'bs': 'Indo-European', 'ca': 'Indo-European', 'cs': 'Indo-European', 'cy': 'Indo-European', 'da': 'Indo-European', 'de': 'Indo-European', 'el': 'Indo-European', 'en': 'Indo-European', 'eo': 'Constructed language', 'es': 'Indo-European', 'et': 'Uralic', 'eu': 'Language isolate', 'fa': 'Missing', 'fi': 'Uralic', 'fr': 'Indo-European', 'fy': 'Indo-European', 'ga': 'Indo-European', 'gd': 'Indo-European', 'gl': 'Indo-European', 'gu': 'Indo-European', 'ha': 'Afro-Asiatic', 'he': 'Afro-Asiatic', 'hi': 'Indo-European', 'hi_rom': 'Indo-European', 'hr': 'Indo-European', 'hu': 'Uralic', 'hy': 'Indo-European', 'id': 'Austronesian', 'is': 'Indo-European', 'it': 'Indo-European', 'ja': 'Japonic', 'jv': 'Austronesian', 'ka': 'Kartvelian', 'kk': 'Turkic', 'km': 'Austro-Asiatic', 'kn': 'Dravidian', 'ko': 'Koreanic', 'ku': 'Indo-European', 'ky': 'Turkic', 'la': 'Indo-European', 'lo': 'Kra-Dai', 'lt': 'Indo-European', 'lv': 'Missing', 'mg': 'Missing', 'mk': 'Indo-European', 'ml': 'Dravidian', 'mn': 'Missing', 'mr': 'Indo-European', 'ms': 'Missing', 'my_zaw': 'Sino-Tibetan', 'my': 'Sino-Tibetan', 'ne': 'Indo-European', 'nl': 'Indo-European', 'no': 'Indo-European', 'om': 'Missing', 'or': 'Indo-European', 'pa': 'Indo-European', 'pl': 'Indo-European', 'ps': 'Missing', 'pt': 'Indo-European', 'ro': 'Indo-European', 'ru': 'Indo-European', 'sa': 'Indo-European', 'sd': 'Indo-European', 'si': 'Indo-European', 'sk': 'Indo-European', 'sl': 'Indo-European', 'so': 'Afro-Asiatic', 'sq': 'Missing', 'sr': 'Indo-European', 'su': 'Austronesian', 'sv': 'Indo-European', 'sw': 'Niger-Congo', 'ta': 'Dravidian', 'ta_rom': 'Dravidian', 'te': 'Dravidian', 'te_rom': 'Dravidian', 'th': 'Kra-Dai', 'tl': 'Austronesian', 'tr': 'Turkic', 'ug': 'Turkic', 'uk': 'Indo-European', 'ur': 'Indo-European', 'ur_rom': 'Indo-European', 'uz': 'Missing', 'vi': 'Austro-Asiatic', 'xh': 'Niger-Congo', 'yi': 'Indo-European', 'zh-Hans': 'Sino-Tibetan', 'zh-Hant': 'Sino-Tibetan'}" --one4all '[0.1,1.0]' --only-update "['encoder.rank_weight.weight', 'encoder.head_weight.weight', 'encoder.hidden_weight.weight', 'encoder.rank_target.weight', 'encoder.head_target.weight', 'encoder.hidden_target.weight']" --fp16

Then we tune all parameters.

    python train.py $YOUR_BINARIZED_DATA_DIR --task multilingual_masked_lm --criterion sparse_masked_lm --sample-break-mode complete --tokens-per-sample 512 --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 --lr-scheduler polynomial_decay --lr 0.0002 --warmup-updates 10000 --total-num-update 150000 --batch-size 16 --update-freq 4 --max-update 150000 --weight-decay 0.01 --arch sparse_xlmr_base --dropout 0.1 --attention-dropout 0.1 --log-format simple --log-interval 100 --no-epoch-checkpoints --dataset-impl mmap --skip-invalid-size-inputs-valid-test --ddp-backend no_c10d --seed 0 --multilang-sampling-alpha 0.7 --monolingual-langs 'af,am,ar,as,az,be,bg,bn,bn_rom,br,bs,ca,cs,cy,da,de,el,en,eo,es,et,eu,fa,fi,fr,fy,ga,gd,gl,gu,ha,he,hi,hi_rom,hr,hu,hy,id,is,it,ja,jv,ka,kk,km,kn,ko,ku,ky,la,lo,lt,lv,mg,mk,ml,mn,mr,ms,my_zaw,my,ne,nl,no,om,or,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,so,sq,sr,su,sv,sw,ta,ta_rom,te,te_rom,th,tl,tr,ug,uk,ur,ur_rom,uz,vi,xh,yi,zh-Hans,zh-Hant' --restore-file sparse_xlmr_base.weight128.diagonal1.init.prior.one4all0.1,1.0.part.fp16/checkpoint_best.pt --save-dir sparse_xlmr_base.weight128.diagonal1.init.prior.one4all0.1,1.0.fp16 --tensorboard-logdir log --embed-factorize --sparse-impl hard_concrete --sparsity-weight 128 --diagonal-weight 1 --init-args "{'score_file':'xlmr.base/importance.pt','step':0.01}" --lang2group "{'af': 'Indo-European', 'am': 'Afro-Asiatic', 'ar': 'Afro-Asiatic', 'as': 'Indo-European', 'az': 'Turkic', 'be': 'Indo-European', 'bg': 'Indo-European', 'bn': 'Indo-European', 'bn_rom': 'Indo-European', 'br': 'Indo-European', 'bs': 'Indo-European', 'ca': 'Indo-European', 'cs': 'Indo-European', 'cy': 'Indo-European', 'da': 'Indo-European', 'de': 'Indo-European', 'el': 'Indo-European', 'en': 'Indo-European', 'eo': 'Constructed language', 'es': 'Indo-European', 'et': 'Uralic', 'eu': 'Language isolate', 'fa': 'Missing', 'fi': 'Uralic', 'fr': 'Indo-European', 'fy': 'Indo-European', 'ga': 'Indo-European', 'gd': 'Indo-European', 'gl': 'Indo-European', 'gu': 'Indo-European', 'ha': 'Afro-Asiatic', 'he': 'Afro-Asiatic', 'hi': 'Indo-European', 'hi_rom': 'Indo-European', 'hr': 'Indo-European', 'hu': 'Uralic', 'hy': 'Indo-European', 'id': 'Austronesian', 'is': 'Indo-European', 'it': 'Indo-European', 'ja': 'Japonic', 'jv': 'Austronesian', 'ka': 'Kartvelian', 'kk': 'Turkic', 'km': 'Austro-Asiatic', 'kn': 'Dravidian', 'ko': 'Koreanic', 'ku': 'Indo-European', 'ky': 'Turkic', 'la': 'Indo-European', 'lo': 'Kra-Dai', 'lt': 'Indo-European', 'lv': 'Missing', 'mg': 'Missing', 'mk': 'Indo-European', 'ml': 'Dravidian', 'mn': 'Missing', 'mr': 'Indo-European', 'ms': 'Missing', 'my_zaw': 'Sino-Tibetan', 'my': 'Sino-Tibetan', 'ne': 'Indo-European', 'nl': 'Indo-European', 'no': 'Indo-European', 'om': 'Missing', 'or': 'Indo-European', 'pa': 'Indo-European', 'pl': 'Indo-European', 'ps': 'Missing', 'pt': 'Indo-European', 'ro': 'Indo-European', 'ru': 'Indo-European', 'sa': 'Indo-European', 'sd': 'Indo-European', 'si': 'Indo-European', 'sk': 'Indo-European', 'sl': 'Indo-European', 'so': 'Afro-Asiatic', 'sq': 'Missing', 'sr': 'Indo-European', 'su': 'Austronesian', 'sv': 'Indo-European', 'sw': 'Niger-Congo', 'ta': 'Dravidian', 'ta_rom': 'Dravidian', 'te': 'Dravidian', 'te_rom': 'Dravidian', 'th': 'Kra-Dai', 'tl': 'Austronesian', 'tr': 'Turkic', 'ug': 'Turkic', 'uk': 'Indo-European', 'ur': 'Indo-European', 'ur_rom': 'Indo-European', 'uz': 'Missing', 'vi': 'Austro-Asiatic', 'xh': 'Niger-Congo', 'yi': 'Indo-European', 'zh-Hans': 'Sino-Tibetan', 'zh-Hant': 'Sino-Tibetan'}" --one4all '[0.1,1.0]' --fp16 --reset-dataloader --reset-lr-scheduler --reset-meters --reset-optimizer

**Note**: the above commands already apply Dynamic Sparsification. If you want to pre-train a model with a fixed sparsity, say 50%, please change `--one4all '[0.5]'`, `'step':0.01` in `--init-args` to `'sparsity':0.5` and `--sparsity-weight 8`.

## Fine-tuning

The fine-tuning code is based on [XTREME](https://github.com/google-research/xtreme). Please copy and replace files in `fine-tune` folder to `third_party/transformers/src/transformers` after entering the code folder of XTREME.

Before fine-tuning, please use the script `scripts/convert_sparse_fairseq_model_to_transformers.py` to convert the pre-trained models.

For reproducing the fine-tuning results (running experiments), please follow the instructions of XTREME.