Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation

Nikolay Bogoychev, Kenneth Heafield, Alham Fikri Aji, Marcin Junczys-Dowmunt


Abstract
In order to extract the best possible performance from asynchronous stochastic gradient descent one must increase the mini-batch size and scale the learning rate accordingly. In order to achieve further speedup we introduce a technique that delays gradient updates effectively increasing the mini-batch size. Unfortunately with the increase of mini-batch size we worsen the stale gradient problem in asynchronous stochastic gradient descent (SGD) which makes the model convergence poor. We introduce local optimizers which mitigate the stale gradient problem and together with fine tuning our momentum we are able to train a shallow machine translation system 27% faster than an optimized baseline with negligible penalty in BLEU.
Anthology ID:
D18-1332
Volume:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Month:
October-November
Year:
2018
Address:
Brussels, Belgium
Editors:
Ellen Riloff, David Chiang, Julia Hockenmaier, Jun’ichi Tsujii
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
2991–2996
Language:
URL:
https://aclanthology.org/D18-1332
DOI:
10.18653/v1/D18-1332
Bibkey:
Cite (ACL):
Nikolay Bogoychev, Kenneth Heafield, Alham Fikri Aji, and Marcin Junczys-Dowmunt. 2018. Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2991–2996, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation (Bogoychev et al., EMNLP 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-2/D18-1332.pdf
Data
WMT 2016