TY - GEN
T1 - Population count on Intel® CPU, GPU and FPGA
AU - Jin, Zheming
AU - Finkel, Hal
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/5
Y1 - 2020/5
N2 - Population count is a primitive used in many applications. Commodity processors have dedicated instructions for achieving high-performance population count. Motivated by the productivity of high-level synthesis and the importance of population count, in this paper we investigated the OpenCL implementations of population count algorithms, and evaluated their performance and resource utilizations on an FPGA. Based on the results, we select the most efficient implementation. Then we derived a reduction pattern from a representative application of population count. We parallelized the reduction with atomic functions, and optimized it with vectorized memory accesses, tree reduction, and compute-unit duplication. We evaluated the performance of the reduction kernel on an InteloXeono CPU and an Intel® IrisTM Pro integrated GPU, and an FPGA card that features an Intel® Arria® 10 FPGA. When DRAM memory bandwidth is comparable on the three computing platforms, the FPGA can achieve the highest kernel performance for large workload. On the other hand, we described performance bottlenecks on the FPGA. To make FPGAs more competitive in raw performance compared to high-performant CPU and GPU platforms, it is important to increase external memory bandwidth, minimize data movement between a host and a device, and reduce OpenCL runtime overhead on an FPGA.
AB - Population count is a primitive used in many applications. Commodity processors have dedicated instructions for achieving high-performance population count. Motivated by the productivity of high-level synthesis and the importance of population count, in this paper we investigated the OpenCL implementations of population count algorithms, and evaluated their performance and resource utilizations on an FPGA. Based on the results, we select the most efficient implementation. Then we derived a reduction pattern from a representative application of population count. We parallelized the reduction with atomic functions, and optimized it with vectorized memory accesses, tree reduction, and compute-unit duplication. We evaluated the performance of the reduction kernel on an InteloXeono CPU and an Intel® IrisTM Pro integrated GPU, and an FPGA card that features an Intel® Arria® 10 FPGA. When DRAM memory bandwidth is comparable on the three computing platforms, the FPGA can achieve the highest kernel performance for large workload. On the other hand, we described performance bottlenecks on the FPGA. To make FPGAs more competitive in raw performance compared to high-performant CPU and GPU platforms, it is important to increase external memory bandwidth, minimize data movement between a host and a device, and reduce OpenCL runtime overhead on an FPGA.
KW - Heterogeneous computing
KW - OpenCL
KW - Population count
UR - http://www.scopus.com/inward/record.url?scp=85091574754&partnerID=8YFLogxK
U2 - 10.1109/IPDPSW50202.2020.00081
DO - 10.1109/IPDPSW50202.2020.00081
M3 - Conference contribution
AN - SCOPUS:85091574754
T3 - Proceedings - 2020 IEEE 34th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2020
SP - 432
EP - 439
BT - Proceedings - 2020 IEEE 34th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 34th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2020
Y2 - 18 May 2020 through 22 May 2020
ER -