Abstract
This work develops new techniques within Horovod, a generic communication library supporting data parallel training across deep learning frameworks. In particular, we improve the Horovod control plane by implementing a new coordination scheme that takes advantage of the characteristics of the typical data parallel training paradigm, namely the repeated execution of collectives on the gradients of a fixed set of tensors. Using a caching strategy, we execute Horovod's existing coordinator-worker logic only once during a typical training run, replacing it with a more efficient decentralized orchestration strategy using the cached data and a global intersection of a bitvector for the remaining training duration. Next, we introduce a feature for end users to explicitly group collective operations, enabling finer grained control over the communication buffer sizes. To evaluate our proposed strategies, we conduct experiments on a world-class supercomputer - Summit. We compare our proposals to Horovod's original design and observe 2× performance improvement at a scale of 6000 GPUs; we also compare them against tf.distribute and torch.DDP and achieve 12% better and comparable performance, respectively, using up to 1536 GPUs; we compare our solution against BytePS in typical HPC settings and achieve about 20% better performance on a scale of 768 GPUs. Finally, we test our strategies on a scientific application (STEMDL) using up to 27,600 GPUs (the entire Summit) and show that we achieve a near-linear scaling of 0.93 with a sustained performance of 1.54 exaflops (with standard error +- 0.02) in FP16 precision.
Original language | English |
---|---|
Title of host publication | Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022 |
Publisher | USENIX Association |
Pages | 1027-1040 |
Number of pages | 14 |
ISBN (Electronic) | 9781939133274 |
State | Published - 2022 |
Event | 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022 - Renton, United States Duration: Apr 4 2022 → Apr 6 2022 |
Publication series
Name | Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022 |
---|
Conference
Conference | 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022 |
---|---|
Country/Territory | United States |
City | Renton |
Period | 04/4/22 → 04/6/22 |
Funding
We would like to thank the anonymous reviewers and our shepherd, Shivaram Venkataraman, for their invaluable comments that improved this paper. This research was partially funded by a Lab Directed Research and Development project at Oak Ridge National Laboratory, a U.S. Department of Energy facility managed by UT-Battelle, LLC. An award of computer time was provided by the INCITE program. This research also used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.