TY - GEN
T1 - Performance-oriented optimizations for OpenCL streaming kernels on the FPGA
AU - Jin, Zheming
AU - Finkel, Hal
N1 - Publisher Copyright:
© 2018 Copyright is held by the owner/author(s). Publication rights licensed to ACM.
PY - 2018/5/14
Y1 - 2018/5/14
N2 - When Field-programmable gate arrays (FPGAs) can implement streaming applications efficiently and high-level synthesis (HLS) tools allow people, who have little hardware design knowledge, to evaluate an application on FPGAs, there is a need to understand where OpenCL and FPGA can play in the streaming domains. To this end, we explore the implementation space and discuss the techniques of optimizing the performance of the streaming kernels using the Intel OpenCL SDK for FPGA. On the Nallatech 385A FPGA platform that features an Arria 10 GX1150 FPGA, the experimental results show that FPGA resources, such as block RAMs and DSPs, can limit the performance of a kernel before the constraint of memory bandwidth takes effect. Kernel vectorization and compute unit duplication are practical optimization techniques that can improve the kernel performance by a factor of 2.8 to 10. The combination of the two techniques can improve the performance by a factor of 3.3 to 16, achieving the highest performance. To improve the performance of streaming kernels with compute unit duplication, the local work size needs to be tuned. The optimal value can increase the performance of a duplicated kernel without tuning by a factor of 3 to 70.
AB - When Field-programmable gate arrays (FPGAs) can implement streaming applications efficiently and high-level synthesis (HLS) tools allow people, who have little hardware design knowledge, to evaluate an application on FPGAs, there is a need to understand where OpenCL and FPGA can play in the streaming domains. To this end, we explore the implementation space and discuss the techniques of optimizing the performance of the streaming kernels using the Intel OpenCL SDK for FPGA. On the Nallatech 385A FPGA platform that features an Arria 10 GX1150 FPGA, the experimental results show that FPGA resources, such as block RAMs and DSPs, can limit the performance of a kernel before the constraint of memory bandwidth takes effect. Kernel vectorization and compute unit duplication are practical optimization techniques that can improve the kernel performance by a factor of 2.8 to 10. The combination of the two techniques can improve the performance by a factor of 3.3 to 16, achieving the highest performance. To improve the performance of streaming kernels with compute unit duplication, the local work size needs to be tuned. The optimal value can increase the performance of a duplicated kernel without tuning by a factor of 3 to 70.
KW - FPGA
KW - OpenCL
KW - Streaming kernels
UR - http://www.scopus.com/inward/record.url?scp=85048775522&partnerID=8YFLogxK
U2 - 10.1145/3204919.3204920
DO - 10.1145/3204919.3204920
M3 - Conference contribution
AN - SCOPUS:85048775522
T3 - ACM International Conference Proceeding Series
BT - Proceedings of the International Workshop on OpenCL, IWOCL 2018
PB - Association for Computing Machinery
T2 - 6th International Workshop on OpenCL, IWOCL 2018
Y2 - 14 May 2018 through 16 May 2018
ER -