TY - GEN
T1 - Proactive process-level live migration in HPC environments
AU - Wang, Chao
AU - Mueller, Frank
AU - Engelmann, Christian
AU - Scott, Stephen L.
PY - 2008
Y1 - 2008
N2 - As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 seconds of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 seconds. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively.
AB - As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 seconds of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 seconds. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively.
UR - http://www.scopus.com/inward/record.url?scp=70350755748&partnerID=8YFLogxK
U2 - 10.1109/SC.2008.5222634
DO - 10.1109/SC.2008.5222634
M3 - Conference contribution
AN - SCOPUS:70350755748
SN - 9781424428359
T3 - 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2008
BT - 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2008
T2 - 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2008
Y2 - 15 November 2008 through 21 November 2008
ER -