@inproceedings{aji-heafield-2020-compressing,
    title = "Compressing Neural Machine Translation Models with 4-bit Precision",
    author = "Aji, Alham Fikri  and
      Heafield, Kenneth",
    editor = "Birch, Alexandra  and
      Finch, Andrew  and
      Hayashi, Hiroaki  and
      Heafield, Kenneth  and
      Junczys-Dowmunt, Marcin  and
      Konstas, Ioannis  and
      Li, Xian  and
      Neubig, Graham  and
      Oda, Yusuke",
    booktitle = "Proceedings of the Fourth Workshop on Neural Generation and Translation",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://preview.aclanthology.org/ingest-emnlp/2020.ngt-1.4/",
    doi = "10.18653/v1/2020.ngt-1.4",
    pages = "35--42",
    abstract = "Neural Machine Translation (NMT) is resource-intensive. We design a quantization procedure to compress fit NMT models better for devices with limited hardware capability. We use logarithmic quantization, instead of the more commonly used fixed-point quantization, based on the empirical fact that parameters distribution is not uniform. We find that biases do not take a lot of memory and show that biases can be left uncompressed to improve the overall quality without affecting the compression rate. We also propose to use an error-feedback mechanism during retraining, to preserve the compressed model as a stale gradient. We empirically show that NMT models based on Transformer or RNN architecture can be compressed up to 4-bit precision without any noticeable quality degradation. Models can be compressed up to binary precision, albeit with lower quality. RNN architecture seems to be more robust towards compression, compared to the Transformer."
}Markdown (Informal)
[Compressing Neural Machine Translation Models with 4-bit Precision](https://preview.aclanthology.org/ingest-emnlp/2020.ngt-1.4/) (Aji & Heafield, NGT 2020)
ACL