TY - GEN
T1 - Communication-Efficient Parallelization Strategy for Deep Convolutional Neural Network Training
AU - Lee, Sunwoo
AU - Agrawal, Ankit
AU - Balaprakash, Prasanna
AU - Choudhary, Alok
AU - Liao, Wei Keng
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/7/2
Y1 - 2018/7/2
N2 - Training Convolutional Neural Network (CNN) models is extremely time-consuming and the efficiency of its parallelization plays a key role in finishing the training in a reasonable amount of time. The well-known synchronous Stochastic Gradient Descent (SGD) algorithm suffers from high costs of inter-process communication and synchronization. To address such problems, asynchronous SGD algorithm employs a master-slave model for parameter update. However, it can result in a poor convergence rate due to the staleness of the gradient. In addition, the master-slave model is not scalable when running on a large number of compute nodes. In this paper, we present a communication-efficient gradient averaging algorithm for synchronous SGD, which adopts a few design strategies to maximize the degree of overlap between computation and communication. The time complexity analysis shows our algorithm outperforms the traditional allreduce-based algorithm. By training the two popular deep CNN models, VGG-16 and ResNet-50, on ImageNet dataset, our experiments performed on Cori Phase-I, a Cray XC40 supercomputer at NERSC show that our algorithm can achieve \mathbf{2516.36}\times speedup for VGG-16 and \mathbf{2734.25}\times speedup for ResNet-50 using up to 8192 cores.
AB - Training Convolutional Neural Network (CNN) models is extremely time-consuming and the efficiency of its parallelization plays a key role in finishing the training in a reasonable amount of time. The well-known synchronous Stochastic Gradient Descent (SGD) algorithm suffers from high costs of inter-process communication and synchronization. To address such problems, asynchronous SGD algorithm employs a master-slave model for parameter update. However, it can result in a poor convergence rate due to the staleness of the gradient. In addition, the master-slave model is not scalable when running on a large number of compute nodes. In this paper, we present a communication-efficient gradient averaging algorithm for synchronous SGD, which adopts a few design strategies to maximize the degree of overlap between computation and communication. The time complexity analysis shows our algorithm outperforms the traditional allreduce-based algorithm. By training the two popular deep CNN models, VGG-16 and ResNet-50, on ImageNet dataset, our experiments performed on Cori Phase-I, a Cray XC40 supercomputer at NERSC show that our algorithm can achieve \mathbf{2516.36}\times speedup for VGG-16 and \mathbf{2734.25}\times speedup for ResNet-50 using up to 8192 cores.
KW - Convolutional Neural Network
KW - Deep Learning
KW - Distributed-Memory Parallelization
KW - Parallelization
UR - http://www.scopus.com/inward/record.url?scp=85063063463&partnerID=8YFLogxK
U2 - 10.1109/MLHPC.2018.8638635
DO - 10.1109/MLHPC.2018.8638635
M3 - Conference contribution
AN - SCOPUS:85063063463
T3 - Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis
SP - 47
EP - 56
BT - Proceedings of MLHPC 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2018 IEEE/ACM Machine Learning in HPC Environments, MLHPC 2018
Y2 - 12 November 2018
ER -