Proactive process-level live migration in HPC environments

Chao Wang, Frank Mueller, Christian Engelmann, Stephen L. Scott

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

110 Scopus citations

Abstract

As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 seconds of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 seconds. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively.

Original languageEnglish
Title of host publication2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2008
DOIs
StatePublished - 2008
Event2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2008 - Austin, TX, United States
Duration: Nov 15 2008Nov 21 2008

Publication series

Name2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2008

Conference

Conference2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2008
Country/TerritoryUnited States
CityAustin, TX
Period11/15/0811/21/08

Fingerprint

Dive into the research topics of 'Proactive process-level live migration in HPC environments'. Together they form a unique fingerprint.

Cite this