Scavenger: A Cloud Service for Optimizing Cost and Performance of ML Training

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

Cloud computing platforms can provide the compu-tational resources required for training large machine learning models such as deep neural networks. While the pay-as-you- go nature of cloud virtual machines (VMs) makes it easy to spin-up large clusters for training models, it can also lead to ballooning costs. The 100s of virtual machine sizes provided by cloud platforms also makes it extremely challenging to select the 'right' cloud cluster configuration for training. Furthermore, the training time and cost of distributed model training is highly sensitive to the cluster configurations, and presents a large and complex tradeoff-space. In this paper, we develop principled and practical techniques for optimizing the training time and cost of distributed ML model training on the cloud. Our key insight is that both the parallel and statistical efficiency must be considered when selecting the optimum job configuration parameters such as the number of workers and the batch size. By combining conventional parallel scaling concepts and new insights into SGD noise, we develop models for estimating the time and cost on different cluster configurations. Using the repetitive nature of training and our performance models, our Scavenger cloud service can search for optimum cloud configurations in a black-box, online manner. Our approach reduces training times by 2 x and costs by more than 50 %. Our performance models are accurate to within 2 %, and our search imposes only a 10% overhead compared to an ideal oracle- based approach.

Original languageEnglish
Title of host publicationProceedings - 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2023
EditorsYogesh Simmhan, Ilkay Altintas, Ana-Lucia Varbanescu, Pavan Balaji, Abhinandan S. Prasad, Lorenzo Carnevale
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages403-413
Number of pages11
ISBN (Electronic)9798350301199
DOIs
StatePublished - 2023
Externally publishedYes
Event23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2023 - Bangalore, India
Duration: May 1 2023May 4 2023

Publication series

NameProceedings - 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2023

Conference

Conference23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2023
Country/TerritoryIndia
CityBangalore
Period05/1/2305/4/23

Funding

VII. CONCLUSION The training time and cost for large machine learning models is significant, and sensitive to many job and cloud configuration parameters. Scavenger is a cloud service which uses online profiling and new performance models for estimating the training performance on different cloud configurations, with high accuracy of over 95%, and reduces training time by 2×. Acknowledgements. This research was supported by the NSF grant OAC-2112606.

Fingerprint

Dive into the research topics of 'Scavenger: A Cloud Service for Optimizing Cost and Performance of ML Training'. Together they form a unique fingerprint.

Cite this