TY - GEN
T1 - A case for virtual machine based fault injection in a high-performance computing environment
AU - Naughton, Thomas
AU - Vallée, Geoffroy
AU - Engelmann, Christian
AU - Scott, Stephen L.
PY - 2012
Y1 - 2012
N2 - Large-scale computing platforms provide tremendous capabilities for scientific discovery. As applications and system software scale up to multi- petaflops and beyond to exascale platforms, the occurrence of failure will be much more common. This has given rise to a push in fault-tolerance and resilience research for high-performance computing (HPC) systems. This includes work on log analysis to identify types of failures, enhancements to the Message Passing Interface (MPI) to incorporate fault awareness, and a variety of fault tolerance mechanisms that span redundant computation, algorithm based fault tolerance, and advanced checkpoint/restart techniques. While there is much work to be done on the FT/Resilience mechanisms for such large-scale systems, there is also a profound gap in the tools for experimentation. This gap is compounded by the fact that HPC environments have stringent performance requirements and are often highly customized. The tool chain for these systems are often tailored for the platform and the operating environments typically contain many site/machine specific enhancements. Therefore, it is desirable to maintain a consistent execution environment to minimize end-user (scientist) interruption. The work on system-level virtualization for HPC system offers a unique opportunity to maintain a consistent execution environment via a virtual machine (VM). Recent work on virtualization for HPC has shown that low-overhead, high performance systems can be realized [7, 15]. Virtualization also provides a clean abstraction for building experimental tools for investigation into the effects of failures in HPC and the related research on FT/Resilience mechanisms and policies. In this paper we discuss the motivation for tools to perform fault injection in an HPC context. We also present the design of a new fault injection framework that can leverage virtualization.
AB - Large-scale computing platforms provide tremendous capabilities for scientific discovery. As applications and system software scale up to multi- petaflops and beyond to exascale platforms, the occurrence of failure will be much more common. This has given rise to a push in fault-tolerance and resilience research for high-performance computing (HPC) systems. This includes work on log analysis to identify types of failures, enhancements to the Message Passing Interface (MPI) to incorporate fault awareness, and a variety of fault tolerance mechanisms that span redundant computation, algorithm based fault tolerance, and advanced checkpoint/restart techniques. While there is much work to be done on the FT/Resilience mechanisms for such large-scale systems, there is also a profound gap in the tools for experimentation. This gap is compounded by the fact that HPC environments have stringent performance requirements and are often highly customized. The tool chain for these systems are often tailored for the platform and the operating environments typically contain many site/machine specific enhancements. Therefore, it is desirable to maintain a consistent execution environment to minimize end-user (scientist) interruption. The work on system-level virtualization for HPC system offers a unique opportunity to maintain a consistent execution environment via a virtual machine (VM). Recent work on virtualization for HPC has shown that low-overhead, high performance systems can be realized [7, 15]. Virtualization also provides a clean abstraction for building experimental tools for investigation into the effects of failures in HPC and the related research on FT/Resilience mechanisms and policies. In this paper we discuss the motivation for tools to perform fault injection in an HPC context. We also present the design of a new fault injection framework that can leverage virtualization.
UR - http://www.scopus.com/inward/record.url?scp=84882570755&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-29737-3_27
DO - 10.1007/978-3-642-29737-3_27
M3 - Conference contribution
AN - SCOPUS:84882570755
SN - 9783642297366
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 234
EP - 243
BT - Euro-Par 2011
PB - Springer Verlag
T2 - 17th Parallel Processing Workshops, Euro-Par 2011: CCPI 2011, CGWS 2011, HeteroPar 2011, HiBB 2011, HPCVirt 2011, HPPC 2011, HPSS 2011, MDGS 2011, ProPer 2011, Resilience 2011, UCHPC 2011, VHPC 2011
Y2 - 29 August 2011 through 2 September 2011
ER -