TY - BOOK
T1 - Evaluating Operators in Deep Neural Networks for Improving Performance Portability of SYCL
AU - Jin, Zheming
PY - 2024/7
Y1 - 2024/7
N2 - SYCL is a portable programming model for heterogeneous computing, so it is important to obtain reasonable performance portability of SYCL. Towards the goal of better understanding and improving performance portability of SYCL for machine learning workloads, we have been developing benchmarks for basic operators in deep neural networks (DNNs). These operators could be offloaded to heterogeneous computing devices such as graphics processing units (GPUs) to speed up computation. In this work, we introduce the benchmarks, evaluate the performance of the operators on GPU-based systems, and describe the causes of the performance gap between the SYCL and Compute Unified Device Architecture (CUDA) kernels. We find that the causes are related to the utilization of the texture cache for read-only data, optimization of the memory accesses with strength reduction, shared local memory accesses, and register usage per thread. We hope that the efforts of developing benchmarks for studying performance portability will stimulate discussion and interactions within the community.
AB - SYCL is a portable programming model for heterogeneous computing, so it is important to obtain reasonable performance portability of SYCL. Towards the goal of better understanding and improving performance portability of SYCL for machine learning workloads, we have been developing benchmarks for basic operators in deep neural networks (DNNs). These operators could be offloaded to heterogeneous computing devices such as graphics processing units (GPUs) to speed up computation. In this work, we introduce the benchmarks, evaluate the performance of the operators on GPU-based systems, and describe the causes of the performance gap between the SYCL and Compute Unified Device Architecture (CUDA) kernels. We find that the causes are related to the utilization of the texture cache for read-only data, optimization of the memory accesses with strength reduction, shared local memory accesses, and register usage per thread. We hope that the efforts of developing benchmarks for studying performance portability will stimulate discussion and interactions within the community.
KW - 97 MATHEMATICS AND COMPUTING
KW - Performance Portability
KW - Benchmarks
KW - DNN operators
U2 - 10.2172/2404613
DO - 10.2172/2404613
M3 - Commissioned report
BT - Evaluating Operators in Deep Neural Networks for Improving Performance Portability of SYCL
CY - United States
ER -