@inproceedings{behnke-heafield-2021-pruning,
    title = "Pruning Neural Machine Translation for Speed Using Group Lasso",
    author = "Behnke, Maximiliana  and
      Heafield, Kenneth",
    editor = "Barrault, Loic  and
      Bojar, Ondrej  and
      Bougares, Fethi  and
      Chatterjee, Rajen  and
      Costa-jussa, Marta R.  and
      Federmann, Christian  and
      Fishel, Mark  and
      Fraser, Alexander  and
      Freitag, Markus  and
      Graham, Yvette  and
      Grundkiewicz, Roman  and
      Guzman, Paco  and
      Haddow, Barry  and
      Huck, Matthias  and
      Yepes, Antonio Jimeno  and
      Koehn, Philipp  and
      Kocmi, Tom  and
      Martins, Andre  and
      Morishita, Makoto  and
      Monz, Christof",
    booktitle = "Proceedings of the Sixth Conference on Machine Translation",
    month = nov,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://preview.aclanthology.org/ingest-emnlp/2021.wmt-1.116/",
    pages = "1074--1086",
    abstract = "Unlike most work on pruning neural networks, we make inference faster. Group lasso regularisation enables pruning entire rows, columns or blocks of parameters that result in a smaller dense network. Because the network is still dense, efficient matrix multiply routines are still used and only minimal software changes are required to support variable layer sizes. Moreover, pruning is applied during training so there is no separate pruning step. Experiments on top of English-{\ensuremath{>}}German models, which already have state-of-the-art speed and size, show that two-thirds of feedforward connections can be removed with 0.2 BLEU loss. With 6 decoder layers, the pruned model is 34{\%} faster; with 2 tied decoder layers, the pruned model is 14{\%} faster. Pruning entire heads and feedforward connections in a 12{--}1 encoder-decoder architecture gains an additional 51{\%} speed-up. These push the Pareto frontier with respect to the trade-off between time and quality compared to strong baselines. In the WMT 2021 Efficiency Task, our pruned and quantised models are 1.9{--}2.7x faster at the cost 0.9{--}1.7 BLEU in comparison to the unoptimised baselines. Across language pairs, we see similar sparsity patterns: an ascending or U-shaped distribution in encoder feedforward and attention layers and an ascending distribution in the decoder."
}Markdown (Informal)
[Pruning Neural Machine Translation for Speed Using Group Lasso](https://preview.aclanthology.org/ingest-emnlp/2021.wmt-1.116/) (Behnke & Heafield, WMT 2021)
ACL