Reducing global reductions in large-scale distributed training

Guojing Cong, Chih Chieh Yang, Fan Zhou

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Current large-scale training of deep neural networks typically employs synchronous stochastic gradient descent that incurs large communication overhead. Instead of optimizing reduction routines as done in recent studies, we propose algorithms that do not require frequent global reductions. We first show that reducing the global reduction frequency works as an effective regularization technique that improves generalization of adaptive optimizers. We then propose an algorithm that reduces the global reduction frequency by employing local reductions on a subset of learners. In addition, to maximize the effect of reduction on convergence, we introduce reduction momentum that further accelerates convergence. Our experiment with the CIFAR-10 dataset shows that for the K-step averaging algorithm extremely sparse reductions help bridge the generalization gap. With 6 GPUs, in comparison to regular synchronous implementations, our implementation reduces more than 99% of global reductions. Further, we show that with 32 GPUs, our implementation reduces the number of global reductions by half. Experimenting with the ImageNet-1K dataset, we show that combining local reductions with global reductions and applying reduction momentum can further reduce global reductions by up to 62% for the same validation accuracy achieved in comparison to K-step averaging. With 400 GPUs global reduction frequency is reduced to once per 102K samples.

Original languageEnglish
Title of host publication48th International Conference on Parallel Processing, ICPP 2019 - Workshop Proceedings
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450371964
DOIs
StatePublished - Aug 5 2019
Event48th International Conference on Parallel Processing, ICPP 2019 - Kyoto, Japan
Duration: Aug 5 2019Aug 8 2019

Publication series

NameACM International Conference Proceeding Series

Conference

Conference48th International Conference on Parallel Processing, ICPP 2019
Country/TerritoryJapan
CityKyoto
Period08/5/1908/8/19

Fingerprint

Dive into the research topics of 'Reducing global reductions in large-scale distributed training'. Together they form a unique fingerprint.

Cite this