Abstract
C++ template metaprogramming has emerged as a prominent approach for achieving performance portability in heterogeneous computing. Kokkos represents a notable paradigm in this domain, offering programmers a suite of high-level abstractions for generic programming while deferring much of the device-specific code generation and optimization to the compiler through template specializations. Kokkos furnishes a range of device-specific code specializations across multiple back ends, including CUDA and HIP. Diverging from conventional back ends, the OpenACC implementation presents a high-level, multicompiler, multidevice, and directive-based programming model. This paper presents recent advancements in the OpenACC back end for Kokkos (i.e., KokkACC) and focuses on its integration into the Kokkos ecosystem, exploration of automatic device selection capabilities to enhance productivity, and performance evaluation on modern hardware such as NVIDIA H100 GPUs. The study includes implementation details and a thorough performance assessment across various computational benchmarks, including minibenchmarks (AXPY and DOT product), miniapps (LULESH, MiniFE, and SNAP-LAMMPS), and a scientific kernel based on the lattice Boltzmann method.
Original language | English |
---|---|
Pages (from-to) | 409-426 |
Number of pages | 18 |
Journal | International Journal of High Performance Computing Applications |
Volume | 38 |
Issue number | 5 |
DOIs | |
State | Published - Sep 2024 |
Funding
The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (https://energy.gov/downloads/doe-public-access-plan). This research used computational resources of the Pegasus system provided by the Multidisciplinary Cooperative Research Program in the Center for Computational Sciences, University of Tsukuba. The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research used resources from the Experimental Computing Laboratory and the Oak Ridge Leadership Computing Facility at Oak Ridge National Laboratory, which is supported by the Office of Science of the US Department of Energy (DOE) under contract DE-AC05-00OR22725. This research was supported in part by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the DOE Office of Science and the National Nuclear Security Administration. This research was also supported in part by the DOE Office of Science, Office of Advanced Scientific Computing Research, and Scientific Discovery through Advanced Computing program. This manuscript has been authored by UT-Battelle LLC under contract DE-AC05-00OR22725 with DOE. The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research used resources from the Experimental Computing Laboratory and the Oak Ridge Leadership Computing Facility at Oak Ridge National Laboratory, which is supported by the Office of Science of the US Department of Energy (DOE) under contract DE-AC05-00OR22725. This research was supported in part by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the DOE Office of Science and the National Nuclear Security Administration. This research was also supported in part by the DOE Office of Science, Office of Advanced Scientific Computing Research, and Scientific Discovery through Advanced Computing program. This manuscript has been authored by UT-Battelle LLC under contract DE-AC05-00OR22725 with DOE. This manuscript has been authored by UT-Battelle LLC under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan ( https://energy.gov/downloads/doe-public-access-plan ).
Keywords
- C++ metaprogramming
- CUDA
- Kokkos
- OpenACC
- OpenMP target
- parallel programming models