TY - GEN
T1 - Optimizing parallel reduction on opencl FPGA platform - A case study of frequent pattern compression
AU - Jin, Zheming
AU - Finkel, Hal
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/8/3
Y1 - 2018/8/3
N2 - Field-programmable gate arrays (FPGAs) are becoming a promising heterogeneous computing component in high-performance computing. To facilitate the usage of FPGAs for developers and researchers, high-level synthesis tools are pushing the FPGA-based design abstraction from the registertransfer level to high-level language design flow using OpenCL/C/C++. Currently, there are few studies on parallel reduction using atomic functions in the OpenCL-based design flow on an FPGA. Inspired by the reduction operation in frequent pattern compression, we transform the function into an OpenCL kernel, and describe the optimizations of the kernel on an Arria10-based FPGA platform as a case study. We found that automatic kernel vectorization does not improve the kernel performance. Users can manually vectorize the kernel to achieve performance speedup. Overall, our optimizations improve the kernel performance by a factor of 11.9 over the baseline kernel. The performance per watt of the kernel on an Intel Arria 10 GX1150 FPGA is 5.3X higher than an Intel Xeon 16-core CPU while 0.625X lower than an Nvidia K80 GPU.
AB - Field-programmable gate arrays (FPGAs) are becoming a promising heterogeneous computing component in high-performance computing. To facilitate the usage of FPGAs for developers and researchers, high-level synthesis tools are pushing the FPGA-based design abstraction from the registertransfer level to high-level language design flow using OpenCL/C/C++. Currently, there are few studies on parallel reduction using atomic functions in the OpenCL-based design flow on an FPGA. Inspired by the reduction operation in frequent pattern compression, we transform the function into an OpenCL kernel, and describe the optimizations of the kernel on an Arria10-based FPGA platform as a case study. We found that automatic kernel vectorization does not improve the kernel performance. Users can manually vectorize the kernel to achieve performance speedup. Overall, our optimizations improve the kernel performance by a factor of 11.9 over the baseline kernel. The performance per watt of the kernel on an Intel Arria 10 GX1150 FPGA is 5.3X higher than an Intel Xeon 16-core CPU while 0.625X lower than an Nvidia K80 GPU.
KW - Atomics
KW - FPGA
KW - OpenCL
KW - Reductions
UR - http://www.scopus.com/inward/record.url?scp=85052205171&partnerID=8YFLogxK
U2 - 10.1109/IPDPSW.2018.00015
DO - 10.1109/IPDPSW.2018.00015
M3 - Conference contribution
AN - SCOPUS:85052205171
SN - 9781538655559
T3 - Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018
SP - 27
EP - 35
BT - Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 32nd IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018
Y2 - 21 May 2018 through 25 May 2018
ER -