TY - GEN
T1 - Base64 encoding on heterogeneous computing platforms
AU - Jin, Zheming
AU - Finkel, Hal
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/7
Y1 - 2019/7
N2 - Base64 encoding has many applications on the Web. Previous studies investigated the optimizations of Base64 encoding algorithm on central processing units (CPUs). In this paper, we describe the optimizations of the algorithm on heterogeneous computing platforms. More specifically, we explain the algorithm, convert the algorithm to kernels written in CUDA C/C++ and Open Computing Language (OpenCL), optimize the CUDA and OpenCL applications with CUDA and OpenCL streams which can overlap data transfers with kernel computations, and vectorize the CUDA and OpenCL kernels to improve kernel throughput. We evaluate the impact of the number of streams upon the kernel performance on an NVIDIA Pascal P100 graphics processing unit (GPU) and a Nallatech 385A card that features an Intel Arria 10 GX1150 field-programmable gate array (FPGA). We also measure the performance and power of the applications on the CPU, GPU, and FPGA to know the advantage of each platform and the benefit of kernel offloading. The experiments show that using vector data types in the kernels is not for performance, and more work-items is better than large vectors per work-item on the GPU. OpenCL and CUDA streams can achieve almost the same performance on the GPU, but streams should be used with caution when GPU resources are underutilized. On the FPGA, kernel vectorization using 16 vector lanes can achieve the highest performance when the number of streams is one. However, increasing the vector width per work-item and the number of streams can decrease the kernel computation time for each stream, and thereby reduce the number of concurrent operations across the streams. While the raw performance on the GPU is 3.1X higher than that on the FPGA, the FPGA consumes 3.4X less power. A comparison with a state-of-the-art implementation on an Intel CPU server shows an increasing benefit of kernel offloading.
AB - Base64 encoding has many applications on the Web. Previous studies investigated the optimizations of Base64 encoding algorithm on central processing units (CPUs). In this paper, we describe the optimizations of the algorithm on heterogeneous computing platforms. More specifically, we explain the algorithm, convert the algorithm to kernels written in CUDA C/C++ and Open Computing Language (OpenCL), optimize the CUDA and OpenCL applications with CUDA and OpenCL streams which can overlap data transfers with kernel computations, and vectorize the CUDA and OpenCL kernels to improve kernel throughput. We evaluate the impact of the number of streams upon the kernel performance on an NVIDIA Pascal P100 graphics processing unit (GPU) and a Nallatech 385A card that features an Intel Arria 10 GX1150 field-programmable gate array (FPGA). We also measure the performance and power of the applications on the CPU, GPU, and FPGA to know the advantage of each platform and the benefit of kernel offloading. The experiments show that using vector data types in the kernels is not for performance, and more work-items is better than large vectors per work-item on the GPU. OpenCL and CUDA streams can achieve almost the same performance on the GPU, but streams should be used with caution when GPU resources are underutilized. On the FPGA, kernel vectorization using 16 vector lanes can achieve the highest performance when the number of streams is one. However, increasing the vector width per work-item and the number of streams can decrease the kernel computation time for each stream, and thereby reduce the number of concurrent operations across the streams. While the raw performance on the GPU is 3.1X higher than that on the FPGA, the FPGA consumes 3.4X less power. A comparison with a state-of-the-art implementation on an Intel CPU server shows an increasing benefit of kernel offloading.
KW - Base64 encoding
KW - CUDA
KW - FPGA
KW - GPU
KW - Heterogeneous computing
KW - OpenCL
KW - Stream
UR - http://www.scopus.com/inward/record.url?scp=85072616819&partnerID=8YFLogxK
U2 - 10.1109/ASAP.2019.00014
DO - 10.1109/ASAP.2019.00014
M3 - Conference contribution
AN - SCOPUS:85072616819
T3 - Proceedings of the International Conference on Application-Specific Systems, Architectures and Processors
SP - 247
EP - 254
BT - Proceedings - 2019 IEEE 30th International Conference on Application-Specific Systems, Architectures and Processors, ASAP 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 30th IEEE International Conference on Application-Specific Systems, Architectures and Processors, ASAP 2019
Y2 - 15 July 2019 through 17 July 2019
ER -