Abstract
Asynchronous stochastic gradient descent (SGD) converges poorly for Transformer models, so synchronous SGD has become the norm for Transformer training. This is unfortunate because asynchronous SGD is faster at raw training speed since it avoids waiting for synchronization. Moreover, the Transformer model is the basis for state-of-the-art models for several tasks, including machine translation, so training speed matters. To understand why asynchronous SGD under-performs, we blur the lines between asynchronous and synchronous methods. We find that summing several asynchronous updates, rather than applying them immediately, restores convergence behavior. With this method, the Transformer attains the same BLEU score 1.36 times as fast.- Anthology ID:
- D19-5608
- Volume:
- Proceedings of the 3rd Workshop on Neural Generation and Translation
- Month:
- November
- Year:
- 2019
- Address:
- Hong Kong
- Editors:
- Alexandra Birch, Andrew Finch, Hiroaki Hayashi, Ioannis Konstas, Thang Luong, Graham Neubig, Yusuke Oda, Katsuhito Sudoh
- Venue:
- NGT
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 80–89
- Language:
- URL:
- https://aclanthology.org/D19-5608
- DOI:
- 10.18653/v1/D19-5608
- Cite (ACL):
- Alham Fikri Aji and Kenneth Heafield. 2019. Making Asynchronous Stochastic Gradient Descent Work for Transformers. In Proceedings of the 3rd Workshop on Neural Generation and Translation, pages 80–89, Hong Kong. Association for Computational Linguistics.
- Cite (Informal):
- Making Asynchronous Stochastic Gradient Descent Work for Transformers (Aji & Heafield, NGT 2019)
- PDF:
- https://preview.aclanthology.org/improve-issue-templates/D19-5608.pdf