Abstract
Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are. The termination of each kernel implicitly acts the barrier required after advancing the solution every time step. We propose an execution model for running memory-bound iterative GPU kernels: PERsistent KernelS (PERKS). In this model, the time loop is moved inside persistent kernel, and device-wide barriers are used for synchronization. We then reduce the traffic to device memory by caching subset of the output in each time step in the unused registers and shared memory. PERKS can be generalized to any iterative solver: they largely independent of the solver's implementation. We explain the design principle of PERKS and demonstrate effectiveness of PERKS for a wide range of iterative 2D/3D stencil benchmarks (geomean speedup of 2.12x for 2D stencils and 1.24x for 3D stencils over state-of-art libraries), and a Krylov subspace conjugate gradient solver (geomean speedup of 4.86x in smaller SpMV datasets from SuiteSparse and 1.43x in larger SpMV datasets over a state-of-art library). All PERKS-based implementations available at: https://github.com/neozhang307/PERKS.
| Original language | English |
|---|---|
| Title of host publication | ACM ICS 2023 - Proceedings of the International Conference on Supercomputing |
| Publisher | Association for Computing Machinery |
| Pages | 167-179 |
| Number of pages | 13 |
| ISBN (Electronic) | 9798400700569 |
| DOIs | |
| State | Published - Jun 21 2023 |
| Event | 37th ACM International Conference on Supercomputing, ICS 2023 - Orlando, United States Duration: Jun 21 2023 → Jun 23 2023 |
Publication series
| Name | Proceedings of the International Conference on Supercomputing |
|---|
Conference
| Conference | 37th ACM International Conference on Supercomputing, ICS 2023 |
|---|---|
| Country/Territory | United States |
| City | Orlando |
| Period | 06/21/23 → 06/23/23 |
Funding
This work was supported by JSPS KAKENHI under Grant Numbers JP22H03600 and JP21K17750. This work was supported by JST, PRESTO Grant Number JPMJPR20MA, Japan. This paper is based on results obtained from JPNP20006 project, commissioned by the New Energy and Industrial Technology Development Organization (NEDO). This manuscript has been co-authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The publisher acknowledges the US government license to provide public access under the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan/). The authors wish to express their sincere appreciation to Jens Domke, Aleksandr Drozd, Emil Vatai and other RIKEN R-CCS colleagues for their invaluable advice and guidance throughout the course of this research. They also wish to thank Dr. Zhao Tuowen from the SambaNova for the helpful discussions. Finally, the first author would also like to express his gratitude to RIKEN R-CCS for offering the opportunity to undertake this research in an intern program.
Keywords
- GPU
- iterative solvers
- memory-bound
- persistent kernel