Large-Scale Distributed Deep Learning: A Study of Mechanisms and Trade-Offs with PyTorch

Elvis Rojas, Fabricio Quirós-Corella, Terry Jones, Esteban Meneses

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

Artificial intelligence is a transforming technology for creating new scientific discoveries, services, and products. Its full potential is achieved when massive data repositories and large-scale computing systems are available. Both factors are becoming easier to obtain daily as sensor networks constantly create open-data archives, and Moore’s law still makes supercomputing power more accessible. However, as deep learning models become larger to tackle data complexity, researchers must determine how to speed up training in those models. This paper uses an experimental approach to try to understand the algorithms and trade-offs associated with distributed deep learning. This study used the Summit supercomputer at Oak Ridge National Laboratory to determine that existing distributed deep learning mechanisms scale in execution time. However, as more nodes are used, accuracy degrades significantly. To solve this, several hyper-parameters must be tuned. The results show that optimizing those parameters is a nontrivial task. We also evaluated the impact of other scaling techniques, such as mixed precision and adaptive parameter optimization.

Original languageEnglish
Title of host publicationHigh Performance Computing - 8th Latin American Conference, CARLA 2021, Revised Selected Papers
EditorsIsidoro Gitler, Carlos Jaime Barrios Hernández, Esteban Meneses
PublisherSpringer Science and Business Media Deutschland GmbH
Pages177-192
Number of pages16
ISBN (Print)9783031042089
DOIs
StatePublished - 2022
Event8th Latin American High Performance Computing Conference, CARLA 2021 - Virtual, Online
Duration: Oct 6 2021Oct 8 2021

Publication series

NameCommunications in Computer and Information Science
Volume1540 CCIS
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

Conference8th Latin American High Performance Computing Conference, CARLA 2021
CityVirtual, Online
Period10/6/2110/8/21

Funding

Notice: This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (https://energy.gov/downloads/doe-public-access-plan).

FundersFunder number
U.S. Department of Energy

    Keywords

    • Distributed deep learning
    • Performance
    • Scalability

    Fingerprint

    Dive into the research topics of 'Large-Scale Distributed Deep Learning: A Study of Mechanisms and Trade-Offs with PyTorch'. Together they form a unique fingerprint.

    Cite this