TY - GEN
T1 - GraVAC
T2 - 16th IEEE International Conference on Cloud Computing, CLOUD 2023
AU - Tyagi, Sahil
AU - Swany, Martin
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Distributed data-parallel (DDP) training improves overall application throughput as multiple devices train on a subset of data and aggregate updates to produce a globally shared model. The periodic synchronization at each iteration incurs considerable overhead, exacerbated by the increasing size and complexity of state-of-the-art neural networks. Although many gradient compression techniques propose to reduce communication cost, the ideal compression factor that leads to maximum speedup or minimum data exchange remains an open-ended problem since it varies with the quality of compression, model size and structure, hardware, network topology and bandwidth. We propose GraVAC, a framework to dynamically adjust compression factor throughout training by evaluating model progress and assessing gradient information loss associated with compression. GraVAC works in an online, black-box manner without any prior assumptions about a model or its hyperparameters, while achieving the same or better accuracy than dense SGD (i.e., no compression) in the same number of iterations/epochs. As opposed to using a static compression factor, GraVAC reduces end-to-end training time for ResNet101, VGG16 and LSTM by 4.32×, 1.95× and 6.67× respectively. Compared to other adaptive schemes, our framework provides 1.94× to 5.63× overall speedup.
AB - Distributed data-parallel (DDP) training improves overall application throughput as multiple devices train on a subset of data and aggregate updates to produce a globally shared model. The periodic synchronization at each iteration incurs considerable overhead, exacerbated by the increasing size and complexity of state-of-the-art neural networks. Although many gradient compression techniques propose to reduce communication cost, the ideal compression factor that leads to maximum speedup or minimum data exchange remains an open-ended problem since it varies with the quality of compression, model size and structure, hardware, network topology and bandwidth. We propose GraVAC, a framework to dynamically adjust compression factor throughout training by evaluating model progress and assessing gradient information loss associated with compression. GraVAC works in an online, black-box manner without any prior assumptions about a model or its hyperparameters, while achieving the same or better accuracy than dense SGD (i.e., no compression) in the same number of iterations/epochs. As opposed to using a static compression factor, GraVAC reduces end-to-end training time for ResNet101, VGG16 and LSTM by 4.32×, 1.95× and 6.67× respectively. Compared to other adaptive schemes, our framework provides 1.94× to 5.63× overall speedup.
KW - adaptive systems
KW - data-parallel training
KW - deep learning
KW - gradient compression
KW - sparsification
UR - https://www.scopus.com/pages/publications/85166516312
U2 - 10.1109/CLOUD60044.2023.00045
DO - 10.1109/CLOUD60044.2023.00045
M3 - Conference contribution
AN - SCOPUS:85166516312
T3 - IEEE International Conference on Cloud Computing, CLOUD
SP - 319
EP - 329
BT - Proceedings - 2023 IEEE 16th International Conference on Cloud Computing, CLOUD 2023
A2 - Ardagna, Claudio
A2 - Atukorala, Nimanthi
A2 - Beckman, Pete
A2 - Chang, Carl K.
A2 - Chang, Rong N.
A2 - Evangelinos, Constantinos
A2 - Fan, Jing
A2 - Fox, Geoffrey C.
A2 - Fox, Judy
A2 - Hagleitner, Christoph
A2 - Jin, Zhi
A2 - Kosar, Tevfik
A2 - Parashar, Manish
PB - IEEE Computer Society
Y2 - 2 July 2023 through 8 July 2023
ER -