TY - GEN
T1 - A Quantitative Study of Deep Learning Training on Heterogeneous Supercomputers
AU - Han, Jingoo
AU - Xu, Luna
AU - Rafique, M. Mustafa
AU - Butt, Ali R.
AU - Lim, Seung Hwan
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/9
Y1 - 2019/9
N2 - Deep learning (DL) has become a key technique for solving complex problems in scientific research and discovery. DL training for science is substantially challenging because it has to deal with massive quantities of multi-dimensional data. High-performance computing (HPC) supercomputers are increasingly being employed for meeting the exponentially growing demand for DL. Multiple GPUs and high-speed interconnect network are needed for supporting DL on HPC systems. However, the excessive use of GPUs without considering effective benefits leads to inefficient resource utilization of these expensive setups. In this paper, we conduct a quantitative analysis to gauge the efficacy of DL workloads on the latest HPC system and identify viability of next-generation DL-optimized heterogeneous supercomputers for enabling researchers to develop more efficient resource management and distributed DL middleware. We evaluate well-known DL models with large-scale datasets using the popular TensorFlow framework, and provide a thorough evaluation including scalability, accuracy, variability, storage resource, GPU-GPU/GPU-CPU data transfer, and GPU utilization. Our analysis reveals that the latest heterogeneous supercomputing cluster shows varying performance trend as compared to the existing literature for single-and multi-node training. To the best of our knowledge, this is the first work to conduct such a quantitative and comprehensive study of DL training on a supercomputing system with multiple GPUs.
AB - Deep learning (DL) has become a key technique for solving complex problems in scientific research and discovery. DL training for science is substantially challenging because it has to deal with massive quantities of multi-dimensional data. High-performance computing (HPC) supercomputers are increasingly being employed for meeting the exponentially growing demand for DL. Multiple GPUs and high-speed interconnect network are needed for supporting DL on HPC systems. However, the excessive use of GPUs without considering effective benefits leads to inefficient resource utilization of these expensive setups. In this paper, we conduct a quantitative analysis to gauge the efficacy of DL workloads on the latest HPC system and identify viability of next-generation DL-optimized heterogeneous supercomputers for enabling researchers to develop more efficient resource management and distributed DL middleware. We evaluate well-known DL models with large-scale datasets using the popular TensorFlow framework, and provide a thorough evaluation including scalability, accuracy, variability, storage resource, GPU-GPU/GPU-CPU data transfer, and GPU utilization. Our analysis reveals that the latest heterogeneous supercomputing cluster shows varying performance trend as compared to the existing literature for single-and multi-node training. To the best of our knowledge, this is the first work to conduct such a quantitative and comprehensive study of DL training on a supercomputing system with multiple GPUs.
KW - Deep learning
KW - GPU Cluster
KW - Heterogeneous Supercomputers
KW - High Performance Computing
KW - TensorFlow
UR - http://www.scopus.com/inward/record.url?scp=85075266793&partnerID=8YFLogxK
U2 - 10.1109/CLUSTER.2019.8890993
DO - 10.1109/CLUSTER.2019.8890993
M3 - Conference contribution
AN - SCOPUS:85075266793
T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC
BT - Proceedings - 2019 IEEE International Conference on Cluster Computing, CLUSTER 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2019 IEEE International Conference on Cluster Computing, CLUSTER 2019
Y2 - 23 September 2019 through 26 September 2019
ER -