Abstract
This paper presents some of the current challenges in designing deep learning artificial intelligence (AI) and integrating it with traditional high-performance computing (HPC) simulations. We evaluate existing packages for their ability to run deep learning models and applications on large-scale HPC systems efficiently, identify challenges, and propose new asynchronous parallelization and optimization techniques for current large-scale heterogeneous systems and upcoming exascale systems. These developments, along with existing HPC AI software capabilities, have been integrated into MagmaDNN, an open-source HPC deep learning framework. Many deep learning frameworks are targeted at data scientists and fall short in providing quality integration into existing HPC workflows. This paper discusses the necessities of an HPC deep learning framework and how those needs can be provided (e.g., as in MagmaDNN) through a deep integration with existing HPC libraries, such as MAGMA and its modular memory management, MPI, CuBLAS, CuDNN, MKL, and HIP. Advancements are also illustrated through the use of algorithmic enhancements in reduced- and mixed-precision, as well as asynchronous optimization methods. Finally, we present illustrations and potential solutions for enhancing traditional compute- and data-intensive applications at ORNL and UTK with AI. The approaches and future challenges are illustrated in materials science, imaging, and climate applications.
Original language | English |
---|---|
Title of host publication | Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI - 17th Smoky Mountains Computational Sciences and Engineering Conference, SMC 2020, Revised Selected Papers |
Editors | Jeffrey Nichols, Arthur ‘Barney’ Maccabe, Suzanne Parete-Koon, Becky Verastegui, Oscar Hernandez, Theresa Ahearn |
Publisher | Springer Science and Business Media Deutschland GmbH |
Pages | 35-50 |
Number of pages | 16 |
ISBN (Print) | 9783030633929 |
DOIs | |
State | Published - 2021 |
Event | 17th Smoky Mountains Computational Sciences and Engineering Conference, SMC 2020 - Virtual, Online Duration: Aug 26 2020 → Aug 28 2020 |
Publication series
Name | Communications in Computer and Information Science |
---|---|
Volume | 1315 CCIS |
ISSN (Print) | 1865-0929 |
ISSN (Electronic) | 1865-0937 |
Conference
Conference | 17th Smoky Mountains Computational Sciences and Engineering Conference, SMC 2020 |
---|---|
City | Virtual, Online |
Period | 08/26/20 → 08/28/20 |
Funding
This work was conducted at the Joint Institute for Computational Sciences (JICS) and the Innovative Computing Laboratory (ICL), sponsored by the National Science Foundation (NSF), through NSF REU Award #1659502 and NSF Award #1709069. This work used hardware donations from NVIDIA as well as the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by NSF grant number ACI-1548562. Computational Resources are available through a XSEDE education allocation award TG-ASC170031.