Abstract
Artificial intelligence is a transforming technology for creating new scientific discoveries, services, and products. Its full potential is achieved when massive data repositories and large-scale computing systems are available. Both factors are becoming easier to obtain daily as sensor networks constantly create open-data archives, and Moore’s law still makes supercomputing power more accessible. However, as deep learning models become larger to tackle data complexity, researchers must determine how to speed up training in those models. This paper uses an experimental approach to try to understand the algorithms and trade-offs associated with distributed deep learning. This study used the Summit supercomputer at Oak Ridge National Laboratory to determine that existing distributed deep learning mechanisms scale in execution time. However, as more nodes are used, accuracy degrades significantly. To solve this, several hyper-parameters must be tuned. The results show that optimizing those parameters is a nontrivial task. We also evaluated the impact of other scaling techniques, such as mixed precision and adaptive parameter optimization.
Original language | English |
---|---|
Title of host publication | High Performance Computing - 8th Latin American Conference, CARLA 2021, Revised Selected Papers |
Editors | Isidoro Gitler, Carlos Jaime Barrios Hernández, Esteban Meneses |
Publisher | Springer Science and Business Media Deutschland GmbH |
Pages | 177-192 |
Number of pages | 16 |
ISBN (Print) | 9783031042089 |
DOIs | |
State | Published - 2022 |
Event | 8th Latin American High Performance Computing Conference, CARLA 2021 - Virtual, Online Duration: Oct 6 2021 → Oct 8 2021 |
Publication series
Name | Communications in Computer and Information Science |
---|---|
Volume | 1540 CCIS |
ISSN (Print) | 1865-0929 |
ISSN (Electronic) | 1865-0937 |
Conference
Conference | 8th Latin American High Performance Computing Conference, CARLA 2021 |
---|---|
City | Virtual, Online |
Period | 10/6/21 → 10/8/21 |
Funding
Notice: This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (https://energy.gov/downloads/doe-public-access-plan).
Keywords
- Distributed deep learning
- Performance
- Scalability