Abstract
Deep learning (DL) has become a key tool for solving complex scientific problems. However, managing the multi-dimensional large-scale data associated with DL, especially atop extant multiple graphics processing units (GPUs) in modern supercomputers poses significant challenges. Moreover, the latest high-performance computing (HPC) architectures bring different performance trends in training throughput compared to the existing studies. Existing DL optimizations such as larger batch size and GPU locality-aware scheduling have little effect on improving DL training throughput performance due to fast CPU-to-GPU connections. Additionally, DL training on multiple GPUs scales sublinearly. Thus, simply adding more GPUs to a system is ineffective. To this end, we design MARBLE, a first-of-its-kind job scheduler, which considers the non-linear scalability of GPUs at the intra-node level to schedule an appropriate number of GPUs per node for a job. By sharing the GPU resources on a node with multiple DL jobs, MARBLE avoids low GPU utilization in current multi-GPU DL training on HPC systems. Our comprehensive evaluation in the Summit supercomputer shows that MARBLE is able to improve DL training performance by up to 48.3% compared to the popular Platform Load Sharing Facility (LSF) scheduler. Compared to the state-of-the-art of DL scheduler, Optimus, MARBLE reduces the job completion time by up to 47%.
Original language | English |
---|---|
Title of host publication | Proceedings - 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGRID 2020 |
Editors | Laurent Lefevre, Carlos A. Varela, George Pallis, Adel N. Toosi, Omer Rana, Rajkumar Buyya |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 272-281 |
Number of pages | 10 |
ISBN (Electronic) | 9781728160955 |
DOIs | |
State | Published - May 2020 |
Event | 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGRID 2020 - Melbourne, Australia Duration: May 11 2020 → May 14 2020 |
Publication series
Name | Proceedings - 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGRID 2020 |
---|
Conference
Conference | 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGRID 2020 |
---|---|
Country/Territory | Australia |
City | Melbourne |
Period | 05/11/20 → 05/14/20 |
Funding
ACKNOWLEDGMENT This work is sponsored in part by the NSF under the grants: CCF-1919113, CNS-1405697, CNS-1615411, and CNS-1565314/1838271. This research used resources of the Oak Ridge Leadership Computing Facility, located in the National Center for Computational Sciences at the Oak Ridge National Laboratory, which is supported by the Office of Science of the DOE under Contract DE-AC05-00OR22725.