# Optimizing Rare Word Accuracy in Direct Speech Translation with a Retrieval-and-Demonstration Approach
ARR submission

## Introduction
Our implementation is based on the fairseq version 0.12.2(https://github.com/facebookresearch/fairseq/tree/v0.12.2).

Our main implementation is in
`./fairseq/fairseq/criterions/label_smoothed_cross_entropy.py`,
`./fairseq/examples/speech_to_text/prep_mustc_data.py`, and
`./fairseq/fairseq_cli/generate.py`.

## Data and Preprocessing
[Download](https://ict.fbk.eu/must-c) and unpack MuST-C data to a path
`${MUSTC_ROOT}/en-${TARGET_LANG_ID}`, then preprocess it with
```bash
# Generate TSV manifests, features, vocabulary
# and configuration for each language
python examples/speech_to_text/prep_mustc_data.py \
  --data-root ${MUSTC_ROOT} --task asr \
  --vocab-type unigram --vocab-size 5000
python examples/speech_to_text/prep_mustc_data.py \
  --data-root ${MUSTC_ROOT} --task st \
  --vocab-type unigram --vocab-size 8000
```

## ASR
#### Training
En-De:
[pretrained-ASR](https://dl.fbaipublicfiles.com/fairseq/s2t/mustc_de_asr_transformer_s.pt) provided by fairseq could be found here
```bash
fairseq-train ${MUSTC_ROOT}/en-de \
  --config-yaml config_asr.yaml --train-subset train_asr --valid-subset dev_asr \
  --save-dir ${ASR_SAVE_DIR} --num-workers 4 --max-tokens 40000 --max-update 100000 \
  --task speech_to_text --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --report-accuracy \
  --arch s2t_transformer_s --optimizer adam --lr 1e-3 --lr-scheduler inverse_sqrt \
  --warmup-updates 10000 --clip-norm 10.0 --seed 1 --update-freq 8
```
#### Inference & Evaluation
```bash
CHECKPOINT_FILENAME=avg_last_10_checkpoint.pt
python scripts/average_checkpoints.py \
  --inputs ${ASR_SAVE_DIR} --num-epoch-checkpoints 10 \
  --output "${ASR_SAVE_DIR}/${CHECKPOINT_FILENAME}"
fairseq-generate ${MUSTC_ROOT}/en-de \
  --config-yaml config_asr.yaml --gen-subset tst-COMMON_asr --task speech_to_text \
  --path ${ASR_SAVE_DIR}/${CHECKPOINT_FILENAME} --max-tokens 50000 --beam 5 \
  --scoring wer --wer-tokenizer 13a --wer-lowercase --wer-remove-punct
```


## ST:
#### Training
En-De:
For ST model:
```bash
fairseq-train ${MUSTC_ROOT}/en-de \
  --config-yaml config_st.yaml --train-subset train_st --valid-subset dev_st \
  --save-dir ${ST_SAVE_DIR} --num-workers 4 --max-tokens 40000 --max-update 100000 \
  --task speech_to_text --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
  --arch s2t_transformer_s --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \
  --warmup-updates 10000 --clip-norm 10.0 --seed 1 --update-freq 8 \
  --load-pretrained-encoder-from ${ASR_SAVE_DIR}/${ASR_CHECKPOINT_FILENAME}
  --skip-invalid-size-inputs-valid-test 
```

#### Inference & Evaluation
Average the last 10 checkpoints and for evaluation.
For ST model:
```bash
CHECKPOINT_FILENAME=avg_last_10_checkpoint.pt
python scripts/average_checkpoints.py \
  --inputs ${ST_SAVE_DIR} --num-epoch-checkpoints 10 \
  --output "${ST_SAVE_DIR}/${ST_CHECKPOINT_FILENAME}"
fairseq-generate ${MUSTC_ROOT}/en-de \
  --config-yaml config_st.yaml --gen-subset tst_st --task speech_to_text --skip-invalid-size-inputs-valid-test \
  --path ${ST_SAVE_DIR}/${ST_CHECKPOINT_FILENAME} \
  --max-tokens 50000 --beam 5 --scoring sacrebleu
```

## ST Adapted
#### Traning
For adapted ST model
(Note that we are using the --ignore-prefix-size 1 as flag, indicating that example sentences are now part of the inputs and should be considered as prefixes during training. The function 'get_lprobs_and_target()' in './fairseq/fairseq/criterions/label_smoothed_cross_entropy.py' will be responsible for computing the exact prefix size (example size).)
```bash
fairseq-train ${MUSTC_ROOT}/en-de \
  --config-yaml config_st.yaml --train-subset train_ex_st --valid-subset dev_st \
  --save-dir ${ADAPTED_ST_SAVE_DIR} --num-workers 4 --max-tokens 40000 --max-update 100000 \
  --task speech_to_text --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
  --arch s2t_transformer_s --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt --ignore-prefix-size 1 \
  --warmup-updates 10000 --clip-norm 10.0 --seed 1 --update-freq 8 --dropout 0.2\
  --load-pretrained-encoder-from ${ASR_SAVE_DIR}/${ASR_CHECKPOINT_FILENAME}
  --skip-invalid-size-inputs-valid-test  --finetune-from-model ${ST_SAVE_DIR}/${ST_CHECKPOINT_FILENAME}
```
#### Inference & Evaluation
For Adapted ST model:
(Note that we are using the --prefix-size 1 as flag, indicating that example sentences are now part of the inputs and should be considered as prefixes during inference. )
```bash
CHECKPOINT_FILENAME=avg_last_10_checkpoint.pt
python scripts/average_checkpoints.py \
  --inputs ${ADAPTED_ST_SAVE_DIR} --num-epoch-checkpoints 10 \
  --output "${ADAPTED_ST_SAVE_DIR}/${ADAPTED_ST_CHECKPOINT_FILENAME}"
fairseq-generate ${MUSTC_ROOT}/en-de \
  --config-yaml config_st.yaml --gen-subset tst_ex_st --task speech_to_text --skip-invalid-size-inputs-valid-test  \
  --path ${ADAPTED_ST_SAVE_DIR}/${ADAPTED_ST_CHECKPOINT_FILENAME} --batch-size 1 --prefix-size 1 \
  --max-tokens 50000 --beam 5 --scoring sacrebleu --max-source-positions 30000
```