On the convergence properties of a k-step averaging stochastic gradient descent algorithm for nonconvex optimization

Fan Zhou, Guojing Cong

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

114 Scopus citations

Abstract

We adopt and analyze a synchronous K-step averaging stochastic gradient descent algorithm which we call K-AVG for solving large scale machine learning problems. We establish the convergence results of K-AVG for nonconvex objectives. Our analysis of K-AVG applies to many existing variants of synchronous SGD. We explain why the K-step delay is necessary and leads to better performance than traditional parallel stochastic gradient descent which is equivalent to K-AVG with K = 1. We also show that K-AVG scales better with the number of learners than asynchronous stochastic gradient descent (ASGD). Another advantage of K-AVG over ASGD is that it allows larger stepsizes and facilitates faster convergence. On a cluster of 128 GPUs, K-AVG is faster than ASGD implementations and achieves better accuracies and faster convergence for training with the CIFAR-10 dataset.

Original languageEnglish
Title of host publicationProceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI 2018
EditorsJerome Lang
PublisherInternational Joint Conferences on Artificial Intelligence
Pages3219-3227
Number of pages9
ISBN (Electronic)9780999241127
DOIs
StatePublished - 2018
Externally publishedYes
Event27th International Joint Conference on Artificial Intelligence, IJCAI 2018 - Stockholm, Sweden
Duration: Jul 13 2018Jul 19 2018

Publication series

NameIJCAI International Joint Conference on Artificial Intelligence
Volume2018-July
ISSN (Print)1045-0823

Conference

Conference27th International Joint Conference on Artificial Intelligence, IJCAI 2018
Country/TerritorySweden
CityStockholm
Period07/13/1807/19/18

Funding

∗Supported in part by NSF Grants DMS-1509739 and CCF-1523768

Fingerprint

Dive into the research topics of 'On the convergence properties of a k-step averaging stochastic gradient descent algorithm for nonconvex optimization'. Together they form a unique fingerprint.

Cite this