TY - GEN
T1 - An Efficient, Distributed Stochastic Gradient Descent Algorithm for Deep-Learning Applications
AU - Cong, Guojing
AU - Bhardwaj, Onkar
AU - Feng, Minwei
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/9/1
Y1 - 2017/9/1
N2 - Parallel and distributed processing is employed to accelerate training for many deep-learning applications with large models and inputs. As it reduces synchronization and communication overhead by tolerating stale gradient updates, asynchronous stochastic gradient descent (ASGD), derived from stochastic gradient descent (SGD), is widely used. Recent theoretical analyses show ASGD converges with linear asymptotic speedup over SGD.Oftentimes glossed over in theoretical analysis are communication overhead and practical learning rates that are critical to the performance of ASGD. After analyzing the communication performance and convergence behavior of ASGD using the Downpour algorithm as an example, we demonstrate the challenges for ASGD to achieve good practical speedup over SGD. We propose a distributed, bulk-synchronous stochastic gradient descent algorithm that allows for sparse gradient aggregation from individual learners. The communication cost is amortized explicitly by a gradient aggregation interval, and global reductions are used instead of a parameter server for gradient aggregation. We prove its convergence and show that it has superior communication performance and convergence behavior over popular ASGD implementations such as Downpour and EAMSGD for deep-learning applications.
AB - Parallel and distributed processing is employed to accelerate training for many deep-learning applications with large models and inputs. As it reduces synchronization and communication overhead by tolerating stale gradient updates, asynchronous stochastic gradient descent (ASGD), derived from stochastic gradient descent (SGD), is widely used. Recent theoretical analyses show ASGD converges with linear asymptotic speedup over SGD.Oftentimes glossed over in theoretical analysis are communication overhead and practical learning rates that are critical to the performance of ASGD. After analyzing the communication performance and convergence behavior of ASGD using the Downpour algorithm as an example, we demonstrate the challenges for ASGD to achieve good practical speedup over SGD. We propose a distributed, bulk-synchronous stochastic gradient descent algorithm that allows for sparse gradient aggregation from individual learners. The communication cost is amortized explicitly by a gradient aggregation interval, and global reductions are used instead of a parameter server for gradient aggregation. We prove its convergence and show that it has superior communication performance and convergence behavior over popular ASGD implementations such as Downpour and EAMSGD for deep-learning applications.
KW - Deep learning
KW - Distributed processing
KW - Stochastic gradient descent
UR - http://www.scopus.com/inward/record.url?scp=85030626279&partnerID=8YFLogxK
U2 - 10.1109/ICPP.2017.10
DO - 10.1109/ICPP.2017.10
M3 - Conference contribution
AN - SCOPUS:85030626279
T3 - Proceedings of the International Conference on Parallel Processing
SP - 11
EP - 20
BT - Proceedings - 46th International Conference on Parallel Processing, ICPP 2017
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 46th International Conference on Parallel Processing, ICPP 2017
Y2 - 14 August 2017 through 17 August 2017
ER -