A Quantitative Study of Deep Learning Training on Heterogeneous Supercomputers

Jingoo Han, Luna Xu, M. Mustafa Rafique, Ali R. Butt, Seung Hwan Lim

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

12 Scopus citations

Abstract

Deep learning (DL) has become a key technique for solving complex problems in scientific research and discovery. DL training for science is substantially challenging because it has to deal with massive quantities of multi-dimensional data. High-performance computing (HPC) supercomputers are increasingly being employed for meeting the exponentially growing demand for DL. Multiple GPUs and high-speed interconnect network are needed for supporting DL on HPC systems. However, the excessive use of GPUs without considering effective benefits leads to inefficient resource utilization of these expensive setups. In this paper, we conduct a quantitative analysis to gauge the efficacy of DL workloads on the latest HPC system and identify viability of next-generation DL-optimized heterogeneous supercomputers for enabling researchers to develop more efficient resource management and distributed DL middleware. We evaluate well-known DL models with large-scale datasets using the popular TensorFlow framework, and provide a thorough evaluation including scalability, accuracy, variability, storage resource, GPU-GPU/GPU-CPU data transfer, and GPU utilization. Our analysis reveals that the latest heterogeneous supercomputing cluster shows varying performance trend as compared to the existing literature for single-and multi-node training. To the best of our knowledge, this is the first work to conduct such a quantitative and comprehensive study of DL training on a supercomputing system with multiple GPUs.

Original languageEnglish
Title of host publicationProceedings - 2019 IEEE International Conference on Cluster Computing, CLUSTER 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781728147345
DOIs
StatePublished - Sep 2019
Event2019 IEEE International Conference on Cluster Computing, CLUSTER 2019 - Albuquerque, United States
Duration: Sep 23 2019Sep 26 2019

Publication series

NameProceedings - IEEE International Conference on Cluster Computing, ICCC
Volume2019-September
ISSN (Print)1552-5244

Conference

Conference2019 IEEE International Conference on Cluster Computing, CLUSTER 2019
Country/TerritoryUnited States
CityAlbuquerque
Period09/23/1909/26/19

Funding

This work is sponsored in part by the NSF under the grants: CNS-1405697, CNS-1615411, and CNS-1565314/1838271. This research used resources of the Oak Ridge Leadership Computing Facility, located in the National Center for Computational Sciences at the Oak Ridge National Laboratory, which is supported by the Office of Science of the DOE under Contract DE-AC05-00OR22725.

FundersFunder number
National Center for Computational Sciences
Oak
National Science Foundation1405697, 1615411, 1838271
U.S. Department of EnergyDE-AC05-00OR22725
National Sleep FoundationCNS-1565314/1838271, CNS-1615411, CNS-1405697
Office of Science
Oak Ridge National Laboratory

    Keywords

    • Deep learning
    • GPU Cluster
    • Heterogeneous Supercomputers
    • High Performance Computing
    • TensorFlow

    Fingerprint

    Dive into the research topics of 'A Quantitative Study of Deep Learning Training on Heterogeneous Supercomputers'. Together they form a unique fingerprint.

    Cite this