Abstract
Iterative stencils are used widely across the spectrum of High Performance Computing (HPC) applications. Many efforts have been put into optimizing stencil GPU kernels, given the prevalence of GPU-accelerated supercomputers. To improve the data locality, temporal blocking is an optimization that combines a batch of time steps to process them together. Under the observation that GPUs are evolving to resemble CPUs in some aspects, we revisit temporal blocking optimizations for GPUs. We explore how temporal blocking schemes can be adapted to the new features in the recent Nvidia GPUs, including large scratchpad memory, hardware prefetching, and device-wide synchronization. We propose a novel temporal blocking method, EBISU, which champions low device occupancy to drive aggressive deep temporal blocking on large tiles that are executed tile-by-tile. We compare EBISU with state-of-the-art temporal blocking libraries: STENCILGEN and AN5D. We also compare with state-of-the-art stencil auto-tuning tools that are equipped with temporal blocking optimizations: ARTEMIS and DRSTENCIL. Over a wide range of stencil benchmarks, EBISU achieves speedups up to 2.53x and a geometric mean speedup of 1.49x over the best state-of-the-art performance in each stencil benchmark.
| Original language | English |
|---|---|
| Title of host publication | ACM ICS 2023 - Proceedings of the International Conference on Supercomputing |
| Publisher | Association for Computing Machinery |
| Pages | 251-263 |
| Number of pages | 13 |
| ISBN (Electronic) | 9798400700569 |
| DOIs | |
| State | Published - Jun 21 2023 |
| Event | 37th ACM International Conference on Supercomputing, ICS 2023 - Orlando, United States Duration: Jun 21 2023 → Jun 23 2023 |
Publication series
| Name | Proceedings of the International Conference on Supercomputing |
|---|
Conference
| Conference | 37th ACM International Conference on Supercomputing, ICS 2023 |
|---|---|
| Country/Territory | United States |
| City | Orlando |
| Period | 06/21/23 → 06/23/23 |
Funding
This work was supported by JSPS KAKENHI under Grant Numbers JP22H03600 and JP21K17750. This work was supported by JST, PRESTO Grant Number JPMJPR20MA, Japan. This paper is based on results obtained from JPNP20006 project, commissioned by the New Energy and Industrial Technology Development Organization (NEDO). This manuscript has been co-authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The publisher acknowledges the US government license to provide public access under the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan/). The authors wish to express their sincere appreciation to Jens Domke, Aleksandr Drozd, Emil Vatai and other RIKEN R-CCS colleagues for their invaluable advice and guidance throughout the course of this research. Finally, the first author would also like to express his gratitude to RIKEN R-CCS for offering the opportunity to undertake this research in an intern program.
Keywords
- GPU
- stencil
- temporal blocking optimizations