Graeme Nail
2021
Efficient Machine Translation with Model Pruning and Quantization
Maximiliana Behnke
|
Nikolay Bogoychev
|
Alham Fikri Aji
|
Kenneth Heafield
|
Graeme Nail
|
Qianqian Zhu
|
Svetlana Tchistiakova
|
Jelmer van der Linde
|
Pinzhen Chen
|
Sidharth Kashyap
|
Roman Grundkiewicz
Proceedings of the Sixth Conference on Machine Translation
We participated in all tracks of the WMT 2021 efficient machine translation task: single-core CPU, multi-core CPU, and GPU hardware with throughput and latency conditions. Our submissions combine several efficiency strategies: knowledge distillation, a simpler simple recurrent unit (SSRU) decoder with one or two layers, lexical shortlists, smaller numerical formats, and pruning. For the CPU track, we used quantized 8-bit models. For the GPU track, we experimented with FP16 and 8-bit integers in tensorcores. Some of our submissions optimize for size via 4-bit log quantization and omitting a lexical shortlist. We have extended pruning to more parts of the network, emphasizing component- and block-level pruning that actually improves speed unlike coefficient-wise pruning.