Abstract
Oak Ridge National Laboratory (ORNL) installed the Summit supercomputer in 2018. Summit is an accelerated-node architecture with 4,608 nodes, each with two IBM P9 and six NVIDIA Volta V100 GPU processors, significant DRAM footprint, robust HBM quantities supporting the GPUs, nonvolatile memory, and fast NVLink and Infiniband interconnects. This machine was designed to deliver over 200 peak double-precision petaflops for scientific modeling and simulation applications and over 3 peak reduced-precision ExaOps. Summit features impact application performance depending on whether the codes are simulation-oriented, write-intensive, data-analysis-oriented, read-intensive, or communication-intensive codes. In the context of artificial intelligence (AI) and machine learning (ML), these features support data-intensive applications that infer and predict statistical relationships in complex datasets. This article presents recent experiences at ORNL using Summit for applications in AI and ML and describes example code and algorithmic changes necessary to use Summit effectively. Finally, this article discusses research directions in scalable ML, including, algorithms research and combining data analysis with modeling and simulation in an accelerated-node, exascale environment.
Original language | English |
---|---|
Article number | 8851173 |
Journal | IBM Journal of Research and Development |
Volume | 63 |
Issue number | 6 |
DOIs | |
State | Published - Nov 1 2019 |
Funding
We would like to acknowledge the helpful discussion and reviews from S. Vazhkudai and S. Seal. This work was supported by the U.S. Department of Energy’s (DOE) Office for Advanced Scientific Computing. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725. This article has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the U.S. DOE. The U.S. government retains—and the publisher, by accepting the article for publication, acknowledges that the U.S. government retains—a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this article, or allow others to do so, for U.S. government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http:// energy.gov/downloads/doe-public-access-plan). This work was supported by the U.S. Department of Energy?s (DOE) Office for Advanced Scientific Computing. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.