TY - GEN
T1 - Large-Scale Distributed Deep Learning
T2 - 8th Latin American High Performance Computing Conference, CARLA 2021
AU - Rojas, Elvis
AU - Quirós-Corella, Fabricio
AU - Jones, Terry
AU - Meneses, Esteban
N1 - Publisher Copyright:
© 2022, UT-Battelle, LLC.
PY - 2022
Y1 - 2022
N2 - Artificial intelligence is a transforming technology for creating new scientific discoveries, services, and products. Its full potential is achieved when massive data repositories and large-scale computing systems are available. Both factors are becoming easier to obtain daily as sensor networks constantly create open-data archives, and Moore’s law still makes supercomputing power more accessible. However, as deep learning models become larger to tackle data complexity, researchers must determine how to speed up training in those models. This paper uses an experimental approach to try to understand the algorithms and trade-offs associated with distributed deep learning. This study used the Summit supercomputer at Oak Ridge National Laboratory to determine that existing distributed deep learning mechanisms scale in execution time. However, as more nodes are used, accuracy degrades significantly. To solve this, several hyper-parameters must be tuned. The results show that optimizing those parameters is a nontrivial task. We also evaluated the impact of other scaling techniques, such as mixed precision and adaptive parameter optimization.
AB - Artificial intelligence is a transforming technology for creating new scientific discoveries, services, and products. Its full potential is achieved when massive data repositories and large-scale computing systems are available. Both factors are becoming easier to obtain daily as sensor networks constantly create open-data archives, and Moore’s law still makes supercomputing power more accessible. However, as deep learning models become larger to tackle data complexity, researchers must determine how to speed up training in those models. This paper uses an experimental approach to try to understand the algorithms and trade-offs associated with distributed deep learning. This study used the Summit supercomputer at Oak Ridge National Laboratory to determine that existing distributed deep learning mechanisms scale in execution time. However, as more nodes are used, accuracy degrades significantly. To solve this, several hyper-parameters must be tuned. The results show that optimizing those parameters is a nontrivial task. We also evaluated the impact of other scaling techniques, such as mixed precision and adaptive parameter optimization.
KW - Distributed deep learning
KW - Performance
KW - Scalability
UR - http://www.scopus.com/inward/record.url?scp=85128986740&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-04209-6_13
DO - 10.1007/978-3-031-04209-6_13
M3 - Conference contribution
AN - SCOPUS:85128986740
SN - 9783031042089
T3 - Communications in Computer and Information Science
SP - 177
EP - 192
BT - High Performance Computing - 8th Latin American Conference, CARLA 2021, Revised Selected Papers
A2 - Gitler, Isidoro
A2 - Barrios Hernández, Carlos Jaime
A2 - Meneses, Esteban
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 6 October 2021 through 8 October 2021
ER -