Accelerating Collective Communication in Data Parallel Training across Deep Learning Frameworks

Joshua Romero, Junqi Yin, Nouamane Laanait, Bing Xie, M. Todd Young, Sean Treichler, Vitalii Starchenko, Albina Borisevich, Alex Sergeev, Michael Matheson

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

13 Scopus citations

Abstract

This work develops new techniques within Horovod, a generic communication library supporting data parallel training across deep learning frameworks. In particular, we improve the Horovod control plane by implementing a new coordination scheme that takes advantage of the characteristics of the typical data parallel training paradigm, namely the repeated execution of collectives on the gradients of a fixed set of tensors. Using a caching strategy, we execute Horovod's existing coordinator-worker logic only once during a typical training run, replacing it with a more efficient decentralized orchestration strategy using the cached data and a global intersection of a bitvector for the remaining training duration. Next, we introduce a feature for end users to explicitly group collective operations, enabling finer grained control over the communication buffer sizes. To evaluate our proposed strategies, we conduct experiments on a world-class supercomputer - Summit. We compare our proposals to Horovod's original design and observe 2× performance improvement at a scale of 6000 GPUs; we also compare them against tf.distribute and torch.DDP and achieve 12% better and comparable performance, respectively, using up to 1536 GPUs; we compare our solution against BytePS in typical HPC settings and achieve about 20% better performance on a scale of 768 GPUs. Finally, we test our strategies on a scientific application (STEMDL) using up to 27,600 GPUs (the entire Summit) and show that we achieve a near-linear scaling of 0.93 with a sustained performance of 1.54 exaflops (with standard error +- 0.02) in FP16 precision.

Original languageEnglish
Title of host publicationProceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022
PublisherUSENIX Association
Pages1027-1040
Number of pages14
ISBN (Electronic)9781939133274
StatePublished - 2022
Event19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022 - Renton, United States
Duration: Apr 4 2022Apr 6 2022

Publication series

NameProceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022

Conference

Conference19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022
Country/TerritoryUnited States
CityRenton
Period04/4/2204/6/22

Funding

We would like to thank the anonymous reviewers and our shepherd, Shivaram Venkataraman, for their invaluable comments that improved this paper. This research was partially funded by a Lab Directed Research and Development project at Oak Ridge National Laboratory, a U.S. Department of Energy facility managed by UT-Battelle, LLC. An award of computer time was provided by the INCITE program. This research also used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.

Fingerprint

Dive into the research topics of 'Accelerating Collective Communication in Data Parallel Training across Deep Learning Frameworks'. Together they form a unique fingerprint.

Cite this