TY - GEN
T1 - The design, deployment, and evaluation of the CORAL pre-exascale systems
AU - Vazhkudai, Sudharshan S.
AU - De Supinski, Bronis R.
AU - Bland, Arthur S.
AU - Geist, Al
AU - Sexton, James
AU - Kahle, Jim
AU - Zimmer, Christopher J.
AU - Atchley, Scott
AU - Oral, Sarp
AU - Maxwell, Don E.
AU - Larrea, Veronica G.Vergara
AU - Bertsch, Adam
AU - Goldstone, Robin
AU - Joubert, Wayne
AU - Chambreau, Chris
AU - Appelhans, David
AU - Blackmore, Robert
AU - Casses, Ben
AU - Chochia, George
AU - Davison, Gene
AU - Ezell, Matthew A.
AU - Gooding, Tom
AU - Gonsiorowski, Elsa
AU - Grinberg, Leopold
AU - Hanson, Bill
AU - Hartner, Bill
AU - Karlin, Ian
AU - Leininger, Matthew L.
AU - Leverman, Dustin
AU - Marroquin, Chris
AU - Moody, Adam
AU - Ohmacht, Martin
AU - Pankajakshan, Ramesh
AU - Pizzano, Fernando
AU - Rogers, James H.
AU - Rosenburg, Bryan
AU - Schmidt, Drew
AU - Shankar, Mallikarjun
AU - Wang, Feiyi
AU - Watson, Py
AU - Walkup, Bob
AU - Weems, Lance D.
AU - Yin, Junqi
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/7/2
Y1 - 2018/7/2
N2 - CORAL, the Collaboration of Oak Ridge, Argonne and Livermore, is fielding two similar IBM systems, Summit and Sierra, with NVIDIA GPUs that will replace the existing Titan and Sequoia systems. Summit and Sierra are currently ranked No. 1 and No. 3, respectively on the Top500 list. We discuss the design and key differences of the systems. Our evaluation of the systems highlights the following. Applications that fit in HBM see the most benefit and may prefer more GPUs; however, for some applications, the CPU-GPU bandwidth is more important than the number of GPUs. The node-local burst buffer scales linearly, and can achieve a 4X improvement over the parallel file system for large jobs; smaller jobs, however, may benefit from writing directly to the PFS. Finally, several CPU, network and memory bound analytics and GPU-bound deep learning codes achieve up to a 11X and 79X speedup/node, respectively over Titan.
AB - CORAL, the Collaboration of Oak Ridge, Argonne and Livermore, is fielding two similar IBM systems, Summit and Sierra, with NVIDIA GPUs that will replace the existing Titan and Sequoia systems. Summit and Sierra are currently ranked No. 1 and No. 3, respectively on the Top500 list. We discuss the design and key differences of the systems. Our evaluation of the systems highlights the following. Applications that fit in HBM see the most benefit and may prefer more GPUs; however, for some applications, the CPU-GPU bandwidth is more important than the number of GPUs. The node-local burst buffer scales linearly, and can achieve a 4X improvement over the parallel file system for large jobs; smaller jobs, however, may benefit from writing directly to the PFS. Finally, several CPU, network and memory bound analytics and GPU-bound deep learning codes achieve up to a 11X and 79X speedup/node, respectively over Titan.
UR - http://www.scopus.com/inward/record.url?scp=85064105525&partnerID=8YFLogxK
U2 - 10.1109/SC.2018.00055
DO - 10.1109/SC.2018.00055
M3 - Conference contribution
AN - SCOPUS:85064105525
T3 - Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
SP - 661
EP - 672
BT - Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
Y2 - 11 November 2018 through 16 November 2018
ER -