@inproceedings{aji-heafield-2019-making,
    title = "Making Asynchronous Stochastic Gradient Descent Work for Transformers",
    author = "Aji, Alham Fikri  and
      Heafield, Kenneth",
    editor = "Birch, Alexandra  and
      Finch, Andrew  and
      Hayashi, Hiroaki  and
      Konstas, Ioannis  and
      Luong, Thang  and
      Neubig, Graham  and
      Oda, Yusuke  and
      Sudoh, Katsuhito",
    booktitle = "Proceedings of the 3rd Workshop on Neural Generation and Translation",
    month = nov,
    year = "2019",
    address = "Hong Kong",
    publisher = "Association for Computational Linguistics",
    url = "https://preview.aclanthology.org/iwcs-25-ingestion/D19-5608/",
    doi = "10.18653/v1/D19-5608",
    pages = "80--89",
    abstract = "Asynchronous stochastic gradient descent (SGD) converges poorly for Transformer models, so synchronous SGD has become the norm for Transformer training. This is unfortunate because asynchronous SGD is faster at raw training speed since it avoids waiting for synchronization. Moreover, the Transformer model is the basis for state-of-the-art models for several tasks, including machine translation, so training speed matters. To understand why asynchronous SGD under-performs, we blur the lines between asynchronous and synchronous methods. We find that summing several asynchronous updates, rather than applying them immediately, restores convergence behavior. With this method, the Transformer attains the same BLEU score 1.36 times as fast."
}Markdown (Informal)
[Making Asynchronous Stochastic Gradient Descent Work for Transformers](https://preview.aclanthology.org/iwcs-25-ingestion/D19-5608/) (Aji & Heafield, NGT 2019)
ACL