TY - GEN
T1 - Performance-portable autotuning of OpenCL kernels for convolutional layers of deep neural networks
AU - Tsai, Yaohung M.
AU - Luszczek, Piotr
AU - Kurzak, Jakub
AU - Dongarra, Jack
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2017/1/27
Y1 - 2017/1/27
N2 - We present a portable and highly-optimized Deep Neural Network (DNN) algorithm and its implementation techniques. Our approach is a novel combination of existing HPC techniques that methodically applies autotuning as well as data layout and low-level optimizations that achieve performance matching and/or exceeding what is possible with either reverse engineering and manual assembly coding or proprietary vendor libraries. The former was done inside the maxDNN implementation and the latter is represented by cuDNN. Our work may be directly applied to the most time consuming part of DNN workflow, namely the training process which often needs a restart when it stagnates due to, for example, diminishing gradients and getting stuck in local minima. With the result of performance tests on a consumer-grade GPU with the latest High Bandwidth Memory (HBM) stack, our methodology can match a server grade hardware at a fraction of the price. Another tuning sweep on a new GPU architecture from a different vendor also attests to the portability of our approach and the quality of our implementation.
AB - We present a portable and highly-optimized Deep Neural Network (DNN) algorithm and its implementation techniques. Our approach is a novel combination of existing HPC techniques that methodically applies autotuning as well as data layout and low-level optimizations that achieve performance matching and/or exceeding what is possible with either reverse engineering and manual assembly coding or proprietary vendor libraries. The former was done inside the maxDNN implementation and the latter is represented by cuDNN. Our work may be directly applied to the most time consuming part of DNN workflow, namely the training process which often needs a restart when it stagnates due to, for example, diminishing gradients and getting stuck in local minima. With the result of performance tests on a consumer-grade GPU with the latest High Bandwidth Memory (HBM) stack, our methodology can match a server grade hardware at a fraction of the price. Another tuning sweep on a new GPU architecture from a different vendor also attests to the portability of our approach and the quality of our implementation.
UR - http://www.scopus.com/inward/record.url?scp=85015243309&partnerID=8YFLogxK
U2 - 10.1109/MLHPC.2016.5
DO - 10.1109/MLHPC.2016.5
M3 - Conference contribution
AN - SCOPUS:85015243309
T3 - Proceedings of MLHPC 2016: Machine Learning in HPC Environments - Held in conjunction with SC 2016: The International Conference for High Performance Computing, Networking, Storage and Analysis
SP - 9
EP - 18
BT - Proceedings of MLHPC 2016
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2016 Machine Learning in HPC Environments, MLHPC 2016
Y2 - 14 November 2016
ER -