Lifting and Dropping VMs to Dynamically Transition between Time- and Space-sharing for Large-Scale HPC Systems

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

As HPC environments increasingly integrate with edge based systems, system architectures will need to handle a broader class of workloads and scheduling requirements. One result of this shift will be the need to simultaneously support bulk-synchronous parallel (BSP) and on-demand service based applications on the same infrastructure. This in turn will require that future resource management approaches utilize both space-shared as well as time-shared resource scheduling strategies. In this work we introduce the concept of "VM-lifting'' (and its inverse "VM-Dropping'') which allows dynamically switching an HPC workload between space-shared and time-shared scheduling regimes. Our work targets co-kernel based HPC system software environments, in which multiple specialized OS kernels execute natively on dedicated physical resource partitions inside a single compute node. With VM-lifting, a native co-kernel can be migrated at runtime to and from locally hosted Virtual Machine Environments due to changing scheduling requirements of the node. This allows an HPC node to be dynamically (re-)configured as either a time-shared Infrastructure-as-a-Service (IaaS) resource or a dedicated space shared resource based on the current workload demands. We have implemented this approach in the context of the Hobbes Exascale System Software stack and have demonstrated that a node can be reconfigured with minimal impact on the running applications.

Original languageEnglish
Title of host publicationHPDC 2022 - Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing
PublisherAssociation for Computing Machinery, Inc
Pages30-42
Number of pages13
ISBN (Electronic)9781450391993
DOIs
StatePublished - Jun 27 2022
Externally publishedYes
Event31st International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2022 - Virtual, Online, United States
Duration: Jun 27 2022Jun 30 2022

Publication series

NameHPDC 2022 - Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing

Conference

Conference31st International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2022
Country/TerritoryUnited States
CityVirtual, Online
Period06/27/2206/30/22

Funding

This work was supported by the National Science Foundation under Grant No. 1718287. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

Keywords

  • high performance computing
  • operating systems
  • virtualization

Fingerprint

Dive into the research topics of 'Lifting and Dropping VMs to Dynamically Transition between Time- and Space-sharing for Large-Scale HPC Systems'. Together they form a unique fingerprint.

Cite this