Abstract
Integer sum reduction is a primitive operation commonly used in scientific computing. Implementing a parallel reduction on a GPU often involves concurrent memory accesses using atomic operations and synchronization of work-items in a work-group. For a better understanding of these operations, we redesigned micro-kernels in the HIP programming language to measure the time of atomic operations over global memory, the cost of barrier synchronization, and reduction within a work-group to shared local memory using one atomic addition per work-item on a compute unit in an AMD MI100 GPU. Then, we describe the implementations of the reduction kernels with vectorized memory accesses, parameterized workload sizes, and vendor's library APIs. Our experimental results show that 1) there is a performance tradeoff between the cost of barrier synchronization and the amount of parallelism from atomic operations over shared local memory when we increase the size of a work-group. 2) a reduction kernel with vectorized memory accesses and vector data types is approximately 3% faster for the large problem size than the kernels written with the vendor's library APIs. 3) the compiler needs to assist the hardware processor with data dependency resolution at the level of instruction set architecture. 4) the power consumption of the kernel execution on the GPU fluctuates between 277 Watts and 301 Watts and the dynamic power of other GPU activities is at most 31 Watts.
Original language | English |
---|---|
Title of host publication | 51st International Conference on Parallel Processing, ICPP 2022 - Workshop Proceedings |
Publisher | Association for Computing Machinery |
ISBN (Electronic) | 9781450394451 |
DOIs | |
State | Published - Aug 29 2022 |
Event | 51st International Conference on Parallel Processing, ICPP 2022 - Virtual, Online, France Duration: Aug 29 2022 → Sep 1 2022 |
Publication series
Name | ACM International Conference Proceeding Series |
---|
Conference
Conference | 51st International Conference on Parallel Processing, ICPP 2022 |
---|---|
Country/Territory | France |
City | Virtual, Online |
Period | 08/29/22 → 09/1/22 |
Funding
We appreciate the reviewers for their criticisms. This research used resources of the Experimental Computing Lab at ORNL. This research was supported by the US Department of Energy Advanced Scientific Computing Research program under Contract No. DE-AC05-00OR22725.
Keywords
- GPU
- Parallel reduction
- programming language