TY - GEN
T1 - Fine-grained exploitation of mixed precision for faster CNN training
AU - Johnston, Jeremy T.
AU - Young, Steven R.
AU - Schuman, Catherine D.
AU - Chae, Junghoon
AU - March, Don D.
AU - Patton, Robert M.
AU - Potok, Thomas E.
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/11
Y1 - 2019/11
N2 - As deep convolutional neural networks (CNNs) have become increasingly popular and successful at an ever-widening number of machine learning tasks specialized hardware has become increasingly available for training and deploying them. NVIDIA's recent Volta architecture includes tensor cores which perform a fused operation reduced and mixed precision (16-bit multiply, 32-bit accumulate). Recent research indicates that, typically, very little is lost (in terms of training accuracy) when half precision is used in place of single precision, and performance gains can be made by doing arithmetic in reduced precision. In this work we demonstrate that making layer-by-layer choices as to the arithmetic/data precision can lead to further performance improvement. In our study of 25,200 CNNs we demonstrate an average speedup (over purely half precision) of 1.27x and speedups as high as 3.64x by appropriately combining single and half precision arithmetic and data types on a layer-by-layer basis.
AB - As deep convolutional neural networks (CNNs) have become increasingly popular and successful at an ever-widening number of machine learning tasks specialized hardware has become increasingly available for training and deploying them. NVIDIA's recent Volta architecture includes tensor cores which perform a fused operation reduced and mixed precision (16-bit multiply, 32-bit accumulate). Recent research indicates that, typically, very little is lost (in terms of training accuracy) when half precision is used in place of single precision, and performance gains can be made by doing arithmetic in reduced precision. In this work we demonstrate that making layer-by-layer choices as to the arithmetic/data precision can lead to further performance improvement. In our study of 25,200 CNNs we demonstrate an average speedup (over purely half precision) of 1.27x and speedups as high as 3.64x by appropriately combining single and half precision arithmetic and data types on a layer-by-layer basis.
UR - http://www.scopus.com/inward/record.url?scp=85078895413&partnerID=8YFLogxK
U2 - 10.1109/MLHPC49564.2019.00007
DO - 10.1109/MLHPC49564.2019.00007
M3 - Conference contribution
AN - SCOPUS:85078895413
T3 - Proceedings of MLHPC 2019: 5th Workshop on Machine Learning in HPC Environments - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis
SP - 9
EP - 18
BT - Proceedings of MLHPC 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 5th IEEE/ACM Workshop on Machine Learning in HPC Environments, MLHPC 2019
Y2 - 18 November 2019
ER -