Skip to main navigation Skip to search Skip to main content

DeepVM: Integrating Spot and On-Demand VMs for Cost-Efficient Deep Learning Clusters in the Cloud

  • Yoochan Kim
  • , Kihyun Kim
  • , Yonghyeon Cho
  • , Jinwoo Kim
  • , Awais Khan
  • , Ki Dong Kang
  • , Baik Song An
  • , Myung Hoon Cha
  • , Hong Yeon Kim
  • , Youngjae Kim

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

6 Scopus citations

Abstract

Distributed Deep Learning (DDL), as a paradigm, dictates the use of GPU-based clusters as the optimal infrastructure for training large-scale Deep Neural Networks (DNNs). However, the high cost of such resources makes them inaccessible to many users. Public cloud services, particularly Spot Virtual Machines (VMs), offer a cost-effective alternative, but their unpredictable availability poses a significant challenge to the crucial checkpointing process in DDL. To address this, we introduce DeepVM, a novel solution that recommends cost-effective cluster configurations by intelligently balancing the use of Spot and On-Demand VMs. DeepVM leverages a four-stage process that analyzes instance performance using the FLOPP (FLoating-point Operations Per Price) metric, performs architecture-level analysis with linear programming, and identifies the optimal configuration for the user-specific needs. Extensive simulations and real-world deployments in the AWS environment demonstrate that DeepVM consistently outperforms other policies, reducing training costs and overall makespan. By enabling cost-effective checkpointing with Spot VMs, DeepVM opens up DDL to a wider range of users and facilitates a more efficient training of complex DNNs.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2024
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages227-235
Number of pages9
ISBN (Electronic)9798350395662
DOIs
StatePublished - 2024
Event24th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2024 - Philadelphia, United States
Duration: May 6 2024May 9 2024

Publication series

NameProceedings - 2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2024

Conference

Conference24th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2024
Country/TerritoryUnited States
CityPhiladelphia
Period05/6/2405/9/24

Funding

This work was supported by the Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2022- 0-00498, Development of high-efficiency AI computing SW core technology for high-speed processing of large learning models). This work was also supported by, and used the resources of, the Oak Ridge Leadership Computing Facility, located in the National Center for Computational Sciences at ORNL, which is managed by UT Battelle, LLC for the U.S. DOE (under the contract No. DE-AC05-00OR22725).

Keywords

  • Checkpoint-Restart
  • Cloud Computing
  • Distributed Deep Learning

Fingerprint

Dive into the research topics of 'DeepVM: Integrating Spot and On-Demand VMs for Cost-Efficient Deep Learning Clusters in the Cloud'. Together they form a unique fingerprint.

Cite this