The design, deployment, and evaluation of the CORAL pre-exascale systems

Sudharshan S. Vazhkudai, Bronis R. De Supinski, Arthur S. Bland, Al Geist, James Sexton, Jim Kahle, Christopher J. Zimmer, Scott Atchley, Sarp Oral, Don E. Maxwell, Veronica G.Vergara Larrea, Adam Bertsch, Robin Goldstone, Wayne Joubert, Chris Chambreau, David Appelhans, Robert Blackmore, Ben Casses, George Chochia, Gene DavisonMatthew A. Ezell, Tom Gooding, Elsa Gonsiorowski, Leopold Grinberg, Bill Hanson, Bill Hartner, Ian Karlin, Matthew L. Leininger, Dustin Leverman, Chris Marroquin, Adam Moody, Martin Ohmacht, Ramesh Pankajakshan, Fernando Pizzano, James H. Rogers, Bryan Rosenburg, Drew Schmidt, Mallikarjun Shankar, Feiyi Wang, Py Watson, Bob Walkup, Lance D. Weems, Junqi Yin

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

132 Scopus citations

Abstract

CORAL, the Collaboration of Oak Ridge, Argonne and Livermore, is fielding two similar IBM systems, Summit and Sierra, with NVIDIA GPUs that will replace the existing Titan and Sequoia systems. Summit and Sierra are currently ranked No. 1 and No. 3, respectively on the Top500 list. We discuss the design and key differences of the systems. Our evaluation of the systems highlights the following. Applications that fit in HBM see the most benefit and may prefer more GPUs; however, for some applications, the CPU-GPU bandwidth is more important than the number of GPUs. The node-local burst buffer scales linearly, and can achieve a 4X improvement over the parallel file system for large jobs; smaller jobs, however, may benefit from writing directly to the PFS. Finally, several CPU, network and memory bound analytics and GPU-bound deep learning codes achieve up to a 11X and 79X speedup/node, respectively over Titan.

Original languageEnglish
Title of host publicationProceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages661-672
Number of pages12
ISBN (Electronic)9781538683842
DOIs
StatePublished - Jul 2 2018
Event2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018 - Dallas, United States
Duration: Nov 11 2018Nov 16 2018

Publication series

NameProceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018

Conference

Conference2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
Country/TerritoryUnited States
CityDallas
Period11/11/1811/16/18

Fingerprint

Dive into the research topics of 'The design, deployment, and evaluation of the CORAL pre-exascale systems'. Together they form a unique fingerprint.

Cite this