TY - GEN
T1 - Exploring integer sum reduction using atomics on Intel CPU
AU - Jin, Zheming
AU - Finkel, Hal
N1 - Publisher Copyright:
© 2019 Copyright is held by the owner/author(s). Publication rights licensed to ACM.
PY - 2019/5/13
Y1 - 2019/5/13
N2 - Atomic functions are useful in updating a shared variable by multiple threads, barrier synchronizations, constructing complex data structures, and building high-level frameworks. In this paper, we focus on the evaluation and analysis of integer sum reduction, a common data parallel primitive. We convert the sequential reduction into parallel OpenCL implementations on a CPU. To understand the relationships between the kernel performance and the operations involved in reduction, we develop three micro-kernels that show the costs of one atomic addition to global memory from one work-item per work-group, a work-group barrier, and reducing within a work-group to local memory using one atomic addition per work-item. The sum reduction kernel with vectorized memory accesses can improve the performance of the baseline kernel for a wide range of work-group sizes. However, the vectorization efficiency shrinks with the growing work-group size. We also find that the vendor’s default OpenCL kernel optimization does not improve the kernel performance. When the vectorization width is 16, the performance speedup of our manual vectorization over the vendor’s auto-vectorization ranges from 1.03 to 16.7. We attribute the performance drop to the fact that the default kernel optimizations instantiate a large number of atomics operations.
AB - Atomic functions are useful in updating a shared variable by multiple threads, barrier synchronizations, constructing complex data structures, and building high-level frameworks. In this paper, we focus on the evaluation and analysis of integer sum reduction, a common data parallel primitive. We convert the sequential reduction into parallel OpenCL implementations on a CPU. To understand the relationships between the kernel performance and the operations involved in reduction, we develop three micro-kernels that show the costs of one atomic addition to global memory from one work-item per work-group, a work-group barrier, and reducing within a work-group to local memory using one atomic addition per work-item. The sum reduction kernel with vectorized memory accesses can improve the performance of the baseline kernel for a wide range of work-group sizes. However, the vectorization efficiency shrinks with the growing work-group size. We also find that the vendor’s default OpenCL kernel optimization does not improve the kernel performance. When the vectorization width is 16, the performance speedup of our manual vectorization over the vendor’s auto-vectorization ranges from 1.03 to 16.7. We attribute the performance drop to the fact that the default kernel optimizations instantiate a large number of atomics operations.
KW - Atomics
KW - CPU
KW - Integer Sum reduction
KW - OpenCL
KW - Vectorization
UR - http://www.scopus.com/inward/record.url?scp=85069153543&partnerID=8YFLogxK
U2 - 10.1145/3318170.3318178
DO - 10.1145/3318170.3318178
M3 - Conference contribution
AN - SCOPUS:85069153543
T3 - ACM International Conference Proceeding Series
BT - Proceedings of the International Workshop on OpenCL, IWOCL 2019
PB - Association for Computing Machinery
T2 - 2019 International Workshop on OpenCL, IWOCL 2019
Y2 - 13 May 2019 through 15 May 2019
ER -