TY - GEN
T1 - Integrating deep learning in domain sciences at Exascale
AU - Archibald, Rick
AU - Chow, Edmond
AU - D’Azevedo, Eduardo
AU - Dongarra, Jack
AU - Eisenbach, Markus
AU - Febbo, Rocco
AU - Lopez, Florent
AU - Nichols, Daniel
AU - Tomov, Stanimire
AU - Wong, Kwai
AU - Yin, Junqi
N1 - Publisher Copyright:
© Springer Nature Switzerland AG 2020.
PY - 2021
Y1 - 2021
N2 - This paper presents some of the current challenges in designing deep learning artificial intelligence (AI) and integrating it with traditional high-performance computing (HPC) simulations. We evaluate existing packages for their ability to run deep learning models and applications on large-scale HPC systems efficiently, identify challenges, and propose new asynchronous parallelization and optimization techniques for current large-scale heterogeneous systems and upcoming exascale systems. These developments, along with existing HPC AI software capabilities, have been integrated into MagmaDNN, an open-source HPC deep learning framework. Many deep learning frameworks are targeted at data scientists and fall short in providing quality integration into existing HPC workflows. This paper discusses the necessities of an HPC deep learning framework and how those needs can be provided (e.g., as in MagmaDNN) through a deep integration with existing HPC libraries, such as MAGMA and its modular memory management, MPI, CuBLAS, CuDNN, MKL, and HIP. Advancements are also illustrated through the use of algorithmic enhancements in reduced- and mixed-precision, as well as asynchronous optimization methods. Finally, we present illustrations and potential solutions for enhancing traditional compute- and data-intensive applications at ORNL and UTK with AI. The approaches and future challenges are illustrated in materials science, imaging, and climate applications.
AB - This paper presents some of the current challenges in designing deep learning artificial intelligence (AI) and integrating it with traditional high-performance computing (HPC) simulations. We evaluate existing packages for their ability to run deep learning models and applications on large-scale HPC systems efficiently, identify challenges, and propose new asynchronous parallelization and optimization techniques for current large-scale heterogeneous systems and upcoming exascale systems. These developments, along with existing HPC AI software capabilities, have been integrated into MagmaDNN, an open-source HPC deep learning framework. Many deep learning frameworks are targeted at data scientists and fall short in providing quality integration into existing HPC workflows. This paper discusses the necessities of an HPC deep learning framework and how those needs can be provided (e.g., as in MagmaDNN) through a deep integration with existing HPC libraries, such as MAGMA and its modular memory management, MPI, CuBLAS, CuDNN, MKL, and HIP. Advancements are also illustrated through the use of algorithmic enhancements in reduced- and mixed-precision, as well as asynchronous optimization methods. Finally, we present illustrations and potential solutions for enhancing traditional compute- and data-intensive applications at ORNL and UTK with AI. The approaches and future challenges are illustrated in materials science, imaging, and climate applications.
UR - http://www.scopus.com/inward/record.url?scp=85107270982&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-63393-6_3
DO - 10.1007/978-3-030-63393-6_3
M3 - Conference contribution
AN - SCOPUS:85107270982
SN - 9783030633929
T3 - Communications in Computer and Information Science
SP - 35
EP - 50
BT - Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI - 17th Smoky Mountains Computational Sciences and Engineering Conference, SMC 2020, Revised Selected Papers
A2 - Nichols, Jeffrey
A2 - Maccabe, Arthur ‘Barney’
A2 - Parete-Koon, Suzanne
A2 - Verastegui, Becky
A2 - Hernandez, Oscar
A2 - Ahearn, Theresa
PB - Springer Science and Business Media Deutschland GmbH
T2 - 17th Smoky Mountains Computational Sciences and Engineering Conference, SMC 2020
Y2 - 26 August 2020 through 28 August 2020
ER -