GraVAC: Adaptive Compression for Communication-Efficient Distributed DL Training

Sahil Tyagi, Martin Swany

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

Distributed data-parallel (DDP) training improves overall application throughput as multiple devices train on a subset of data and aggregate updates to produce a globally shared model. The periodic synchronization at each iteration incurs considerable overhead, exacerbated by the increasing size and complexity of state-of-the-art neural networks. Although many gradient compression techniques propose to reduce communication cost, the ideal compression factor that leads to maximum speedup or minimum data exchange remains an open-ended problem since it varies with the quality of compression, model size and structure, hardware, network topology and bandwidth. We propose GraVAC, a framework to dynamically adjust compression factor throughout training by evaluating model progress and assessing gradient information loss associated with compression. GraVAC works in an online, black-box manner without any prior assumptions about a model or its hyperparameters, while achieving the same or better accuracy than dense SGD (i.e., no compression) in the same number of iterations/epochs. As opposed to using a static compression factor, GraVAC reduces end-to-end training time for ResNet101, VGG16 and LSTM by 4.32×, 1.95× and 6.67× respectively. Compared to other adaptive schemes, our framework provides 1.94× to 5.63× overall speedup.

Original languageEnglish
Title of host publicationProceedings - 2023 IEEE 16th International Conference on Cloud Computing, CLOUD 2023
EditorsClaudio Ardagna, Nimanthi Atukorala, Pete Beckman, Carl K. Chang, Rong N. Chang, Constantinos Evangelinos, Jing Fan, Geoffrey C. Fox, Judy Fox, Christoph Hagleitner, Zhi Jin, Tevfik Kosar, Manish Parashar
PublisherIEEE Computer Society
Pages319-329
Number of pages11
ISBN (Electronic)9798350304817
DOIs
StatePublished - 2023
Externally publishedYes
Event16th IEEE International Conference on Cloud Computing, CLOUD 2023 - Hybrid, Chicago, United States
Duration: Jul 2 2023Jul 8 2023

Publication series

NameIEEE International Conference on Cloud Computing, CLOUD
Volume2023-July
ISSN (Print)2159-6182
ISSN (Electronic)2159-6190

Conference

Conference16th IEEE International Conference on Cloud Computing, CLOUD 2023
Country/TerritoryUnited States
CityHybrid, Chicago
Period07/2/2307/8/23

Keywords

  • adaptive systems
  • data-parallel training
  • deep learning
  • gradient compression
  • sparsification

Fingerprint

Dive into the research topics of 'GraVAC: Adaptive Compression for Communication-Efficient Distributed DL Training'. Together they form a unique fingerprint.

Cite this