TY - GEN
T1 - Towards native execution of deep learning on a leadership-class HPC system
AU - Yoginath, Srikanth
AU - Alam, Maksudul
AU - Ramanathan, Arvind
AU - Bhowmik, Debsindhu
AU - Laanait, Nouamane
AU - Perumalla, Kalyan S.
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/5
Y1 - 2019/5
N2 - Large parallel machines generally offer the best parallel performance with 'native execution' that is achieved using codes developed with the optimized compilers, communication libraries, and runtimes offered on the machines. In this paper, we report and analyze performance results from native execution of deep learning on a leadership-class high-performance computing (HPC) system. Using our new code called DeepEx, we present a study of the parallel speed up and convergence rates of learning achieved with native parallel execution. In the trade-off between computational parallelism and synchronized convergence, we first focus on maximizing parallelism while still obtaining convergence. Scaling results are reported from execution on up to 15,000 GPUs using two scientific data sets from atom microscopy and protein folding applications, and also using the popular ImageNet data set. In terms of the traditional measure of parallel speed up, excellent scaling is observed up to 12,000 GPUs. Additionally, accounting for convergence rates of deep learning accuracy or error, a deep learning-specific metric called 'learning speed up' is also tracked. The performance results indicate the need to evaluate parallel deep learning execution in terms of learning speed up, and point to additional directions for improved exploitation of high-end HPC systems.
AB - Large parallel machines generally offer the best parallel performance with 'native execution' that is achieved using codes developed with the optimized compilers, communication libraries, and runtimes offered on the machines. In this paper, we report and analyze performance results from native execution of deep learning on a leadership-class high-performance computing (HPC) system. Using our new code called DeepEx, we present a study of the parallel speed up and convergence rates of learning achieved with native parallel execution. In the trade-off between computational parallelism and synchronized convergence, we first focus on maximizing parallelism while still obtaining convergence. Scaling results are reported from execution on up to 15,000 GPUs using two scientific data sets from atom microscopy and protein folding applications, and also using the popular ImageNet data set. In terms of the traditional measure of parallel speed up, excellent scaling is observed up to 12,000 GPUs. Additionally, accounting for convergence rates of deep learning accuracy or error, a deep learning-specific metric called 'learning speed up' is also tracked. The performance results indicate the need to evaluate parallel deep learning execution in terms of learning speed up, and point to additional directions for improved exploitation of high-end HPC systems.
KW - Deep Learning
KW - Learning Speedup
KW - Massively Parallel Systems
KW - Parallel Speedup
UR - http://www.scopus.com/inward/record.url?scp=85070383771&partnerID=8YFLogxK
U2 - 10.1109/IPDPSW.2019.00160
DO - 10.1109/IPDPSW.2019.00160
M3 - Conference contribution
AN - SCOPUS:85070383771
T3 - Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019
SP - 941
EP - 950
BT - Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 33rd IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019
Y2 - 20 May 2019 through 24 May 2019
ER -