TY - GEN
T1 - Strategies to deploy and scale deep learning on the summit supercomputer
AU - Yin, Junqi
AU - Gahlot, Shubhankar
AU - Laanait, Nouamane
AU - Maheshwari, Ketan
AU - Morrison, Jack
AU - Dash, Sajal
AU - Shankar, Mallikarjun
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/11
Y1 - 2019/11
N2 - The rapid growth and wide applicability of Deep Learning (DL) frameworks poses challenges to computing centers which need to deploy and support the software, and also to domain scientists who have to keep up with the system environment and scale up scientific exploration through DL. We offer recommendations for deploying and scaling DL frameworks on the Summit supercomputer, currently atop the Top500 list, at the Oak Ridge National Laboratory Leadership Computing Facility (OLCF). We discuss DL software deployment in the form of containers, and compare performance of native-built frameworks and containerized deployment. Software containers show no noticeable negative performance impact and exhibit faster Python loading times and promise easier maintenance. To explore strategies for scaling up DL model training campaigns, we assess DL compute kernel performance, discuss and recommend I/O data formats and staging, and identify communication needs for scalable message exchange for DL runs at scale. We recommend that users take a step-wise tuning approach beginning with algorithmic kernel choice, node I/O configuration, and communications tuning as best-practice. We present baseline examples of scaling efficiency 87% for a DL run of ResNet50 running on 1024 nodes (6144 V100 GPUs).
AB - The rapid growth and wide applicability of Deep Learning (DL) frameworks poses challenges to computing centers which need to deploy and support the software, and also to domain scientists who have to keep up with the system environment and scale up scientific exploration through DL. We offer recommendations for deploying and scaling DL frameworks on the Summit supercomputer, currently atop the Top500 list, at the Oak Ridge National Laboratory Leadership Computing Facility (OLCF). We discuss DL software deployment in the form of containers, and compare performance of native-built frameworks and containerized deployment. Software containers show no noticeable negative performance impact and exhibit faster Python loading times and promise easier maintenance. To explore strategies for scaling up DL model training campaigns, we assess DL compute kernel performance, discuss and recommend I/O data formats and staging, and identify communication needs for scalable message exchange for DL runs at scale. We recommend that users take a step-wise tuning approach beginning with algorithmic kernel choice, node I/O configuration, and communications tuning as best-practice. We present baseline examples of scaling efficiency 87% for a DL run of ResNet50 running on 1024 nodes (6144 V100 GPUs).
KW - HPC
KW - Performance evaluation
KW - Scalable machine learning
KW - Software deployment
UR - http://www.scopus.com/inward/record.url?scp=85078093675&partnerID=8YFLogxK
U2 - 10.1109/DLS49591.2019.00016
DO - 10.1109/DLS49591.2019.00016
M3 - Conference contribution
AN - SCOPUS:85078093675
T3 - Proceedings of DLS 2019: Deep Learning on Supercomputers - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis
SP - 84
EP - 94
BT - Proceedings of DLS 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 3rd IEEE/ACM Workshop on Deep Learning on Supercomputers, DLS 2019
Y2 - 17 November 2019
ER -