TY - GEN
T1 - Accelerating Distributed ML Training via Selective Synchronization (Poster Abstract)
AU - Tyagi, Sahil
AU - Swany, Martin
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - In Bulk-synchronous parallel (BSP) or simply synchronous training, deep neural networks (DNNs) are launched across multiple workers concurrently and aggregate their local updates either by Parameter server (PS) [1] or via decentralized AllReduce [2]. Thus, aggregation step on every iteration is blocking, i.e., all workers must wait for reduction phase to complete before proceeding to the next step. ML accelerators like GPUs and TPUs have reduced computation times, but communication cost continues to increase with the growing size of DNNs. Even with weak scaling and Gustafson's law [3] , distributed training does not scale linearly with the number of workers due to high synchronization overhead.
AB - In Bulk-synchronous parallel (BSP) or simply synchronous training, deep neural networks (DNNs) are launched across multiple workers concurrently and aggregate their local updates either by Parameter server (PS) [1] or via decentralized AllReduce [2]. Thus, aggregation step on every iteration is blocking, i.e., all workers must wait for reduction phase to complete before proceeding to the next step. ML accelerators like GPUs and TPUs have reduced computation times, but communication cost continues to increase with the growing size of DNNs. Even with weak scaling and Gustafson's law [3] , distributed training does not scale linearly with the number of workers due to high synchronization overhead.
KW - deep learning
KW - distributed training
KW - machine learning
UR - https://www.scopus.com/pages/publications/85179619970
U2 - 10.1109/CLUSTERWorkshops61457.2023.00023
DO - 10.1109/CLUSTERWorkshops61457.2023.00023
M3 - Conference contribution
AN - SCOPUS:85179619970
T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC
SP - 56
EP - 57
BT - Proceedings - 2023 IEEE International Conference on Cluster Computing Workshops and Posters, CLUSTER Workshops 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 25th IEEE International Conference on Cluster Computing Workshops, CLUSTER Workshops 2023
Y2 - 31 October 2023 through 3 November 2023
ER -