TY - GEN
T1 - Reducing global reductions in large-scale distributed training
AU - Cong, Guojing
AU - Yang, Chih Chieh
AU - Zhou, Fan
N1 - Publisher Copyright:
© 2019 ACM.
PY - 2019/8/5
Y1 - 2019/8/5
N2 - Current large-scale training of deep neural networks typically employs synchronous stochastic gradient descent that incurs large communication overhead. Instead of optimizing reduction routines as done in recent studies, we propose algorithms that do not require frequent global reductions. We first show that reducing the global reduction frequency works as an effective regularization technique that improves generalization of adaptive optimizers. We then propose an algorithm that reduces the global reduction frequency by employing local reductions on a subset of learners. In addition, to maximize the effect of reduction on convergence, we introduce reduction momentum that further accelerates convergence. Our experiment with the CIFAR-10 dataset shows that for the K-step averaging algorithm extremely sparse reductions help bridge the generalization gap. With 6 GPUs, in comparison to regular synchronous implementations, our implementation reduces more than 99% of global reductions. Further, we show that with 32 GPUs, our implementation reduces the number of global reductions by half. Experimenting with the ImageNet-1K dataset, we show that combining local reductions with global reductions and applying reduction momentum can further reduce global reductions by up to 62% for the same validation accuracy achieved in comparison to K-step averaging. With 400 GPUs global reduction frequency is reduced to once per 102K samples.
AB - Current large-scale training of deep neural networks typically employs synchronous stochastic gradient descent that incurs large communication overhead. Instead of optimizing reduction routines as done in recent studies, we propose algorithms that do not require frequent global reductions. We first show that reducing the global reduction frequency works as an effective regularization technique that improves generalization of adaptive optimizers. We then propose an algorithm that reduces the global reduction frequency by employing local reductions on a subset of learners. In addition, to maximize the effect of reduction on convergence, we introduce reduction momentum that further accelerates convergence. Our experiment with the CIFAR-10 dataset shows that for the K-step averaging algorithm extremely sparse reductions help bridge the generalization gap. With 6 GPUs, in comparison to regular synchronous implementations, our implementation reduces more than 99% of global reductions. Further, we show that with 32 GPUs, our implementation reduces the number of global reductions by half. Experimenting with the ImageNet-1K dataset, we show that combining local reductions with global reductions and applying reduction momentum can further reduce global reductions by up to 62% for the same validation accuracy achieved in comparison to K-step averaging. With 400 GPUs global reduction frequency is reduced to once per 102K samples.
UR - http://www.scopus.com/inward/record.url?scp=85117541505&partnerID=8YFLogxK
U2 - 10.1145/3339186.3339203
DO - 10.1145/3339186.3339203
M3 - Conference contribution
AN - SCOPUS:85117541505
T3 - ACM International Conference Proceeding Series
BT - 48th International Conference on Parallel Processing, ICPP 2019 - Workshop Proceedings
PB - Association for Computing Machinery
T2 - 48th International Conference on Parallel Processing, ICPP 2019
Y2 - 5 August 2019 through 8 August 2019
ER -