TY - GEN
T1 - Assessing the impact of ABFT and checkpoint composite strategies
AU - Bosilca, George
AU - Bouteiller, Aurelien
AU - Herault, Thomas
AU - Robert, Yves
AU - Dongarra, Jack
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2014/11/27
Y1 - 2014/11/27
N2 - Algorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalability and performance in failure-prone environments. With the advances in the theoretical and practical understanding of algorithmic traits enabling such approaches, a growing number of frequently used algorithms (including all widely used factorization) have been proven ABFT-capable. In the context of larger applications, these algorithms provide a temporal section of the execution when the data is protected by it's own intrinsic properties, and can be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only fault-tolerance approach that is currently used for these applications is checkpoint/restart. In this paper we propose a model to investigate the efficiency of a composite protocol, that alternates between ABFT and checkpoint/restart for effective protection of an iterative application composed of ABFT-aware and ABFT-unaware sections. We validate this model using a simulator. The model and simulator show that this composite approach drastically increases the performance delivered by an execution platform, especially at scale, by providing means to rarefy the checkpoints while simultaneously decreasing the volume of data needed to be check pointed.
AB - Algorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalability and performance in failure-prone environments. With the advances in the theoretical and practical understanding of algorithmic traits enabling such approaches, a growing number of frequently used algorithms (including all widely used factorization) have been proven ABFT-capable. In the context of larger applications, these algorithms provide a temporal section of the execution when the data is protected by it's own intrinsic properties, and can be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only fault-tolerance approach that is currently used for these applications is checkpoint/restart. In this paper we propose a model to investigate the efficiency of a composite protocol, that alternates between ABFT and checkpoint/restart for effective protection of an iterative application composed of ABFT-aware and ABFT-unaware sections. We validate this model using a simulator. The model and simulator show that this composite approach drastically increases the performance delivered by an execution platform, especially at scale, by providing means to rarefy the checkpoints while simultaneously decreasing the volume of data needed to be check pointed.
KW - ABFT
KW - Checkpoint
KW - Fault-tolerance
KW - High-performance computing
KW - Model
KW - Performance evaluation
KW - Resilience
UR - http://www.scopus.com/inward/record.url?scp=84918801852&partnerID=8YFLogxK
U2 - 10.1109/IPDPSW.2014.79
DO - 10.1109/IPDPSW.2014.79
M3 - Conference contribution
AN - SCOPUS:84918801852
T3 - Proceedings - IEEE 28th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2014
SP - 679
EP - 688
BT - Proceedings - IEEE 28th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2014
PB - IEEE Computer Society
T2 - 28th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2014
Y2 - 19 May 2014 through 23 May 2014
ER -