Scavenger: A Cloud Service for Optimizing Cost and Performance of DL Training

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

Deep learning (DL) models learn non-linear functions and relationships by iteratively training on given data. To accelerate training further, data-parallel training [1] launches multiple instances of training process on separate partitions of data and periodically aggregates model updates. With the availability of VMs in the cloud, choosing the 'right' cluster configuration for data-parallel training presents non-trivial challenges. We tackle this problem by considering both the parallel and statistical efficiency of distributed training w.r.t. the cluster size configuration and batch-size in training. We build performance models to evaluate the pareto-relationship between cost and time of DL training across different cluster and batch-size configurations and develop Scavenger as a cloud service for searching optimum cloud configurations in an online, blackbox manner.

Original languageEnglish
Title of host publicationProceedings - 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing Workshops, CCGridW 2023
EditorsYogesh Simmhan, Ilkay Altintas, Ana-Lucia Varbanescu, Pavan Balaji, Abhinandan S. Prasad, Lorenzo Carnevale
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages349-350
Number of pages2
ISBN (Electronic)9798350302080
DOIs
StatePublished - 2023
Event23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing Workshops, CCGridW 2023 - Bangalore, India
Duration: May 1 2023May 4 2023

Publication series

NameProceedings - 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing Workshops, CCGridW 2023

Conference

Conference23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing Workshops, CCGridW 2023
Country/TerritoryIndia
CityBangalore
Period05/1/2305/4/23

Keywords

  • data parallel training
  • deep learning
  • distributed training
  • machine learning
  • performance modeling

Fingerprint

Dive into the research topics of 'Scavenger: A Cloud Service for Optimizing Cost and Performance of DL Training'. Together they form a unique fingerprint.

Cite this