Elastic distributed training with fast convergence and efficient resource utilization

Guojing Cong, Guojing Cong

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Distributed learning is now routinely conducted on cloud as well as dedicated clusters. Training with elastic resources brings new challenges and design choices. Prior studies focus on runtime performance and assume a static algorithmic behavior. In this work, by analyzing the impact of of resource scaling on convergence, we introduce schedules for synchronous stochastic gradient descent that proactively adapt the number of learners to reduce training time and improve convergence. Our approach no longer assumes a constant number of processors throughout training. In our experiment, distributed stochastic gradient descent with dynamic schedules and reduction momentum achieves better convergence and significant speedups over prior static ones. Numerous distributed training jobs running on cloud may benefit from our approach.

Original languageEnglish
Title of host publicationProceedings - 20th IEEE International Conference on Machine Learning and Applications, ICMLA 2021
EditorsM. Arif Wani, Ishwar K. Sethi, Weisong Shi, Guangzhi Qu, Daniela Stan Raicu, Ruoming Jin
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages972-979
Number of pages8
ISBN (Electronic)9781665443371
DOIs
StatePublished - 2021
Event20th IEEE International Conference on Machine Learning and Applications, ICMLA 2021 - Virtual, Online, United States
Duration: Dec 13 2021Dec 16 2021

Publication series

NameProceedings - 20th IEEE International Conference on Machine Learning and Applications, ICMLA 2021

Conference

Conference20th IEEE International Conference on Machine Learning and Applications, ICMLA 2021
Country/TerritoryUnited States
CityVirtual, Online
Period12/13/2112/16/21

Funding

This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725. This material is based upon work supported in part by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, under contract number DE-AC05-00OR22725, and in part by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, managed by UT-Battelle, LLC. Notice: This manuscript has been authored in part by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

FundersFunder number
U.S. Department of Energy
Office of Science
Advanced Scientific Computing ResearchDE-AC05-00OR22725
Oak Ridge National Laboratory
UT-Battelle

    Keywords

    • Cloud
    • Distributed Training
    • Elastic training
    • Momentum
    • SGD

    Fingerprint

    Dive into the research topics of 'Elastic distributed training with fast convergence and efficient resource utilization'. Together they form a unique fingerprint.

    Cite this