TY - GEN
T1 - Accelerating deep neural network learning for speech recognition on a cluster of GPUs
AU - Cong, Guojing
AU - Kingsbury, Brian
AU - Gosh, Soumyadip
AU - Saon, George
AU - Zhou, Fan
N1 - Publisher Copyright:
© 2017 Copyright held by the owner/author(s).
PY - 2017/11/12
Y1 - 2017/11/12
N2 - We train deep neural networks to solve the acoustic modeling problem for large-vocabulary continuous speech recognition. We employ distributed processing using a cluster of GPUs. On modern GPUs, the sequential implementation takes over a day to train, and efficient parallelization without losing accuracy is notoriously hard. We show that ASGD methods for parallelization are not efficient for this application. Even with 4 GPUs, the overhead is significant, and the accuracies achieved are poor. We adapt a P-learner K-step model averaging algorithm that with 4 GPUs achieves accuracies comparable to that achieved by the sequential implementation. We further introduce adaptive measures that make our parallel implementation scale to the full cluster of 20 GPUs. Ultimately our parallel implementation achieves better accuracies than the sequential implementation with a 6.1 times speedup.
AB - We train deep neural networks to solve the acoustic modeling problem for large-vocabulary continuous speech recognition. We employ distributed processing using a cluster of GPUs. On modern GPUs, the sequential implementation takes over a day to train, and efficient parallelization without losing accuracy is notoriously hard. We show that ASGD methods for parallelization are not efficient for this application. Even with 4 GPUs, the overhead is significant, and the accuracies achieved are poor. We adapt a P-learner K-step model averaging algorithm that with 4 GPUs achieves accuracies comparable to that achieved by the sequential implementation. We further introduce adaptive measures that make our parallel implementation scale to the full cluster of 20 GPUs. Ultimately our parallel implementation achieves better accuracies than the sequential implementation with a 6.1 times speedup.
UR - http://www.scopus.com/inward/record.url?scp=85047198230&partnerID=8YFLogxK
U2 - 10.1145/3146347.3146351
DO - 10.1145/3146347.3146351
M3 - Conference contribution
AN - SCOPUS:85047198230
T3 - Proceedings of MLHPC 2017: Machine Learning in HPC Environments - Held in conjunction with SC 2017: The International Conference for High Performance Computing, Networking, Storage and Analysis
BT - Proceedings of MLHPC 2017
PB - Association for Computing Machinery, Inc
T2 - 2017 Machine Learning in HPC Environments, MLHPC 2017
Y2 - 12 November 2017 through 17 November 2017
ER -