Accelerating dca++ (dynamical cluster approximation) scientific application on the summit supercomputer

Giovanni Balduzzi, Arghya Chatterjee, Ying Wai Li, Peter W. Doak, Urs Haehner, Ed F. D'Azevedo, Thomas A. Maier, Thomas Schulthess

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

Optimizing scientific applications on today's accelerator-based high performance computing systems can be challenging, especially when multiple GPUs and CPUs with heterogeneous memories and persistent non-volatile memories are present. An example is Summit, an accelerator-based system at the Oak Ridge Leadership Computing Facility (OLCF) that is rated as the world's fastest supercomputer to-date. New strategies are thus needed to expose the parallelism in legacy applications, while being amenable to efficient mapping to the underlying architecture. In this paper we discuss our experiences and strategies to port a scientific application, DCA++, to Summit. DCA++ is a high-performance research application that solves quantum many-body problems with a cutting edge quantum cluster algorithm, the dynamical cluster approximation. Our strategies aim to synergize the strengths of the different programming models in the code. These include: A) streamlining the interactions between the CPU threads and the GPUs, b) implementing computing kernels on the GPUs and decreasing CPU-GPU memory transfers, c) allowing asynchronous GPU communications, and d) increasing compute intensity by combining linear algebraic operations. Full-scale production runs using all 4600 Summit nodes attained a peak performance of 73.5 PFLOPS with a mixed precision implementation. We observed a perfect strong and weak scaling for the quantum Monte Carlo solver in DCA++, while encountering about 2x input/output (I/O) and MPI communication overhead on the time-To-solution for the full machine run. Our hardware agnostic optimizations are designed to alleviate the communication and I/O challenges observed, while improving the compute intensity and obtaining optimal performance on a complex, hybrid architecture like Summit.

Original languageEnglish
Title of host publicationProceedings - 2019 28th International Conference on Parallel Architectures and Compilation Techniques, PACT 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages432-443
Number of pages12
ISBN (Electronic)9781728136134
DOIs
StatePublished - Sep 2019
Event28th International Conference on Parallel Architectures and Compilation Techniques, PACT 2019 - Seattle, United States
Duration: Sep 21 2019Sep 25 2019

Publication series

NameParallel Architectures and Compilation Techniques - Conference Proceedings, PACT
Volume2019-September
ISSN (Print)1089-795X

Conference

Conference28th International Conference on Parallel Architectures and Compilation Techniques, PACT 2019
Country/TerritoryUnited States
CitySeattle
Period09/21/1909/25/19

Funding

This manuscript has been co-authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). ¶All three authors have equal contribution to the manuscript. † Ying Wai Li contributed to this work mostly during her previous appointment at Oak Ridge National Laboratory, Oak Ridge 37831, U.S. Authors would like to thank Oscar Hernandez (ORNL), Jeff Larkin (NVIDIA), Don Maxwell (ORNL), Ronny Bren-del (Score-P), John Mellor-Crummey (HPCToolkit) for their insights during the optimization phase of DCA++. This work was supported by the Scientific Discovery through Advanced Computing (SciDAC) program funded by U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research and Basic Energy Sciences, Division of Materials Sciences and Engineering. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.

FundersFunder number
DOE Office of Science User Facility supportedDE-AC05-00OR22725
LLC
Oak
Scientific Discovery
UT-Battelle
U.S. Department of Energy
Office of Science
Advanced Scientific Computing Research
Division of Materials Sciences and Engineering

    Keywords

    • CUDA
    • CUDA aware MPI
    • DCA
    • QMC
    • Quantum Monte Carlo
    • Spectrum MPI
    • Summit@OLCF

    Fingerprint

    Dive into the research topics of 'Accelerating dca++ (dynamical cluster approximation) scientific application on the summit supercomputer'. Together they form a unique fingerprint.

    Cite this