TY - GEN
T1 - CIFTS
T2 - 38th International Conference on Parallel Processing, ICPP-2009
AU - Gupta, R.
AU - Beckman, P.
AU - Park, B. H.
AU - Lusk, E.
AU - Hargrove, P.
AU - Geist, A.
AU - Panda, D. K.
AU - Lumsdaine, A.
AU - Dongarra, J.
PY - 2009
Y1 - 2009
N2 - Considerable work has been done on providing fault tolerance capabilities for different software components on large-scale high-end computing systems. Thus far, however, these fault-tolerant components have worked insularly and independently and information about faults is rarely shared. Such lack of system-wide fault tolerance is emerging as one of the biggest problems on leadership-class systems. In this paper, we propose a coordinated infrastructure, named CIFTS, that enables system software components to share fault information with each other and adapt to faults in a holistic manner. Central to the CIFTS infrastructure is a Fault Tolerance Backplane (FTB) that enables fault notification and awareness throughout the software stack, including fault-aware libraries, middleware, and applications. We present details of the CIFTS infrastructure and the interface specification that has allowed various software programs, including MPICH2, MVAPICH, Open MPI, and PVFS, to plug into the CIFTS infrastructure. Further, through a detailed evaluation we demonstrate the nonintrusive low-overhead capability of CIFTS that lets applications run with minimal performance degradation.
AB - Considerable work has been done on providing fault tolerance capabilities for different software components on large-scale high-end computing systems. Thus far, however, these fault-tolerant components have worked insularly and independently and information about faults is rarely shared. Such lack of system-wide fault tolerance is emerging as one of the biggest problems on leadership-class systems. In this paper, we propose a coordinated infrastructure, named CIFTS, that enables system software components to share fault information with each other and adapt to faults in a holistic manner. Central to the CIFTS infrastructure is a Fault Tolerance Backplane (FTB) that enables fault notification and awareness throughout the software stack, including fault-aware libraries, middleware, and applications. We present details of the CIFTS infrastructure and the interface specification that has allowed various software programs, including MPICH2, MVAPICH, Open MPI, and PVFS, to plug into the CIFTS infrastructure. Further, through a detailed evaluation we demonstrate the nonintrusive low-overhead capability of CIFTS that lets applications run with minimal performance degradation.
UR - http://www.scopus.com/inward/record.url?scp=77951481809&partnerID=8YFLogxK
U2 - 10.1109/ICPP.2009.20
DO - 10.1109/ICPP.2009.20
M3 - Conference contribution
AN - SCOPUS:77951481809
SN - 9780769538020
T3 - Proceedings of the International Conference on Parallel Processing
SP - 237
EP - 245
BT - ICPP-2009 - The 38th International Conference on Parallel Processing
Y2 - 22 September 2009 through 25 September 2009
ER -