Skip to main navigation Skip to search Skip to main content

Accelerating Collective Communication in Data Parallel Training across Deep Learning Frameworks

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

27 Scopus citations

Abstract

This work develops new techniques within Horovod, a generic communication library supporting data parallel training across deep learning frameworks. In particular, we improve the Horovod control plane by implementing a new coordination scheme that takes advantage of the characteristics of the typical data parallel training paradigm, namely the repeated execution of collectives on the gradients of a fixed set of tensors. Using a caching strategy, we execute Horovod's existing coordinator-worker logic only once during a typical training run, replacing it with a more efficient decentralized orchestration strategy using the cached data and a global intersection of a bitvector for the remaining training duration. Next, we introduce a feature for end users to explicitly group collective operations, enabling finer grained control over the communication buffer sizes. To evaluate our proposed strategies, we conduct experiments on a world-class supercomputer - Summit. We compare our proposals to Horovod's original design and observe 2× performance improvement at a scale of 6000 GPUs; we also compare them against tf.distribute and torch.DDP and achieve 12% better and comparable performance, respectively, using up to 1536 GPUs; we compare our solution against BytePS in typical HPC settings and achieve about 20% better performance on a scale of 768 GPUs. Finally, we test our strategies on a scientific application (STEMDL) using up to 27,600 GPUs (the entire Summit) and show that we achieve a near-linear scaling of 0.93 with a sustained performance of 1.54 exaflops (with standard error +- 0.02) in FP16 precision.

Original languageEnglish
Title of host publicationProceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022
PublisherUSENIX Association
Pages1027-1040
Number of pages14
ISBN (Electronic)9781939133274
StatePublished - 2022
Event19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022 - Renton, United States
Duration: Apr 4 2022Apr 6 2022

Publication series

NameProceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022

Conference

Conference19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022
Country/TerritoryUnited States
CityRenton
Period04/4/2204/6/22

Funding

We would like to thank the anonymous reviewers and our shepherd, Shivaram Venkataraman, for their invaluable comments that improved this paper. This research was partially funded by a Lab Directed Research and Development project at Oak Ridge National Laboratory, a U.S. Department of Energy facility managed by UT-Battelle, LLC. An award of computer time was provided by the INCITE program. This research also used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.

Fingerprint

Dive into the research topics of 'Accelerating Collective Communication in Data Parallel Training across Deep Learning Frameworks'. Together they form a unique fingerprint.

Cite this