Strategies to deploy and scale deep learning on the summit supercomputer

Junqi Yin, Shubhankar Gahlot, Nouamane Laanait, Ketan Maheshwari, Jack Morrison, Sajal Dash, Mallikarjun Shankar

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

19 Scopus citations

Abstract

The rapid growth and wide applicability of Deep Learning (DL) frameworks poses challenges to computing centers which need to deploy and support the software, and also to domain scientists who have to keep up with the system environment and scale up scientific exploration through DL. We offer recommendations for deploying and scaling DL frameworks on the Summit supercomputer, currently atop the Top500 list, at the Oak Ridge National Laboratory Leadership Computing Facility (OLCF). We discuss DL software deployment in the form of containers, and compare performance of native-built frameworks and containerized deployment. Software containers show no noticeable negative performance impact and exhibit faster Python loading times and promise easier maintenance. To explore strategies for scaling up DL model training campaigns, we assess DL compute kernel performance, discuss and recommend I/O data formats and staging, and identify communication needs for scalable message exchange for DL runs at scale. We recommend that users take a step-wise tuning approach beginning with algorithmic kernel choice, node I/O configuration, and communications tuning as best-practice. We present baseline examples of scaling efficiency 87% for a DL run of ResNet50 running on 1024 nodes (6144 V100 GPUs).

Original languageEnglish
Title of host publicationProceedings of DLS 2019
Subtitle of host publicationDeep Learning on Supercomputers - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages84-94
Number of pages11
ISBN (Electronic)9781728160115
DOIs
StatePublished - Nov 2019
Event3rd IEEE/ACM Workshop on Deep Learning on Supercomputers, DLS 2019 - Denver, United States
Duration: Nov 17 2019 → …

Publication series

NameProceedings of DLS 2019: Deep Learning on Supercomputers - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference3rd IEEE/ACM Workshop on Deep Learning on Supercomputers, DLS 2019
Country/TerritoryUnited States
CityDenver
Period11/17/19 → …

Funding

This research was sponsored by and used resources of the Oak Ridge Leadership Computing Facility (OLCF), which is a DOE Office of Science User Facility and the Compute and Data Environment for Science (CADES) at the Oak Ridge National Laboratory supported by the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. Part of the computer time was also provided by the INCITE program.

FundersFunder number
CADES
Data Environment for Science
U.S. Department of EnergyDE-AC05-00OR22725
Office of Science

    Keywords

    • HPC
    • Performance evaluation
    • Scalable machine learning
    • Software deployment

    Fingerprint

    Dive into the research topics of 'Strategies to deploy and scale deep learning on the summit supercomputer'. Together they form a unique fingerprint.

    Cite this