Proactive process-level live migration and back migration in HPC environments

Chao Wang, Frank Mueller, Christian Engelmann, Stephen L. Scott

Research output: Contribution to journalArticlepeer-review

31 Scopus citations

Abstract

As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of process migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 16.5 s of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 1324 s. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively. The work also provides a novel back migration approach to eliminate load imbalance or bottlenecks caused by migrated tasks. Experiments indicate the larger the amount of outstanding execution, the higher the benefit due to back migration.

Original languageEnglish
Pages (from-to)254-267
Number of pages14
JournalJournal of Parallel and Distributed Computing
Volume72
Issue number2
DOIs
StatePublished - Feb 2012

Funding

This work was supported in part by NSF grants CCR-0237570 (CAREER), CNS-0410203 , CCF-0429653 , CNS-1058779 , DOE DE-FG02-08ER25837 and DOE DE-FG02-05ER25664 . The research at ORNL was supported by Office of Advanced Scientific Computing Research and DOE DE-AC05-00OR22725 with UT-Battelle, LLC. Frank Mueller is a Professor in Computer Science and a member of multiple research centers at North Carolina State University. Previously, he held positions at Lawrence Livermore National Laboratory and Humboldt University Berlin, Germany. He received his Ph.D. from Florida State University in 1994. He has published papers in the areas of parallel and distributed systems, embedded and real-time systems and compilers. He is a member of ACM SIGPLAN, ACM SIGBED and a senior member of the ACM and IEEE Computer Societies. He is a recipient of an NSF Career Award, an IBM Faculty Award, a Google Research Award and a Fellowship from the Humboldt Foundation.

FundersFunder number
Google Research Award
National Science FoundationCCF-0429653, CCR-0237570, CNS-0410203, CNS-1058779
U.S. Department of EnergyDE-FG02-05ER25664, DE-FG02-08ER25837
International Business Machines Corporation
Alexander von Humboldt-Stiftung
Advanced Scientific Computing ResearchDE-AC05-00OR22725

    Keywords

    • Back migration
    • Fault tolerance
    • Health monitoring
    • High-performance computing
    • Live migration

    Fingerprint

    Dive into the research topics of 'Proactive process-level live migration and back migration in HPC environments'. Together they form a unique fingerprint.

    Cite this