TY - GEN
T1 - Multi-criteria checkpointing strategies
T2 - 19th International Conference on Parallel Processing, Euro-Par 2013
AU - Bouteiller, Aurelien
AU - Cappello, Franck
AU - Dongarra, Jack
AU - Guermouche, Amina
AU - Hérault, Thomas
AU - Robert, Yves
PY - 2013
Y1 - 2013
N2 - Failures are increasingly threatening the efficiency of HPC systems, and current projections of Exascale platforms indicate that rollback recovery, the most convenient method for providing fault tolerance to general-purpose applications, reaches its own limits at such scales. One of the reasons explaining this unnerving situation comes from the focus that has been given to per-application completion time, rather than to platform efficiency. In this paper, we discuss the case of uncoordinated rollback recovery where the idle time spent waiting recovering processors is used to progress a different, independent application from the system batch queue. We then propose an extended model of uncoordinated checkpointing that can discriminate between idle time and wasted computation. We instantiate this model in a simulator to demonstrate that, with this strategy, uncoordinated checkpointing per application completion time is unchanged, while it delivers near-perfect platform efficiency.
AB - Failures are increasingly threatening the efficiency of HPC systems, and current projections of Exascale platforms indicate that rollback recovery, the most convenient method for providing fault tolerance to general-purpose applications, reaches its own limits at such scales. One of the reasons explaining this unnerving situation comes from the focus that has been given to per-application completion time, rather than to platform efficiency. In this paper, we discuss the case of uncoordinated rollback recovery where the idle time spent waiting recovering processors is used to progress a different, independent application from the system batch queue. We then propose an extended model of uncoordinated checkpointing that can discriminate between idle time and wasted computation. We instantiate this model in a simulator to demonstrate that, with this strategy, uncoordinated checkpointing per application completion time is unchanged, while it delivers near-perfect platform efficiency.
UR - http://www.scopus.com/inward/record.url?scp=84883201136&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-40047-6_43
DO - 10.1007/978-3-642-40047-6_43
M3 - Conference contribution
AN - SCOPUS:84883201136
SN - 9783642400469
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 420
EP - 431
BT - Euro-Par 2013 Parallel Processing - 19th International Conference, Proceedings
Y2 - 26 August 2013 through 30 August 2013
ER -