TY - GEN
T1 - A Study on Atomics-based Integer Sum Reduction in HIP on AMD GPU
AU - Jin, Zheming
AU - Vetter, Jeffrey
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/8/29
Y1 - 2022/8/29
N2 - Integer sum reduction is a primitive operation commonly used in scientific computing. Implementing a parallel reduction on a GPU often involves concurrent memory accesses using atomic operations and synchronization of work-items in a work-group. For a better understanding of these operations, we redesigned micro-kernels in the HIP programming language to measure the time of atomic operations over global memory, the cost of barrier synchronization, and reduction within a work-group to shared local memory using one atomic addition per work-item on a compute unit in an AMD MI100 GPU. Then, we describe the implementations of the reduction kernels with vectorized memory accesses, parameterized workload sizes, and vendor's library APIs. Our experimental results show that 1) there is a performance tradeoff between the cost of barrier synchronization and the amount of parallelism from atomic operations over shared local memory when we increase the size of a work-group. 2) a reduction kernel with vectorized memory accesses and vector data types is approximately 3% faster for the large problem size than the kernels written with the vendor's library APIs. 3) the compiler needs to assist the hardware processor with data dependency resolution at the level of instruction set architecture. 4) the power consumption of the kernel execution on the GPU fluctuates between 277 Watts and 301 Watts and the dynamic power of other GPU activities is at most 31 Watts.
AB - Integer sum reduction is a primitive operation commonly used in scientific computing. Implementing a parallel reduction on a GPU often involves concurrent memory accesses using atomic operations and synchronization of work-items in a work-group. For a better understanding of these operations, we redesigned micro-kernels in the HIP programming language to measure the time of atomic operations over global memory, the cost of barrier synchronization, and reduction within a work-group to shared local memory using one atomic addition per work-item on a compute unit in an AMD MI100 GPU. Then, we describe the implementations of the reduction kernels with vectorized memory accesses, parameterized workload sizes, and vendor's library APIs. Our experimental results show that 1) there is a performance tradeoff between the cost of barrier synchronization and the amount of parallelism from atomic operations over shared local memory when we increase the size of a work-group. 2) a reduction kernel with vectorized memory accesses and vector data types is approximately 3% faster for the large problem size than the kernels written with the vendor's library APIs. 3) the compiler needs to assist the hardware processor with data dependency resolution at the level of instruction set architecture. 4) the power consumption of the kernel execution on the GPU fluctuates between 277 Watts and 301 Watts and the dynamic power of other GPU activities is at most 31 Watts.
KW - GPU
KW - Parallel reduction
KW - programming language
UR - http://www.scopus.com/inward/record.url?scp=85147417367&partnerID=8YFLogxK
U2 - 10.1145/3547276.3548627
DO - 10.1145/3547276.3548627
M3 - Conference contribution
AN - SCOPUS:85147417367
T3 - ACM International Conference Proceeding Series
BT - 51st International Conference on Parallel Processing, ICPP 2022 - Workshop Proceedings
PB - Association for Computing Machinery
T2 - 51st International Conference on Parallel Processing, ICPP 2022
Y2 - 29 August 2022 through 1 September 2022
ER -