TY - GEN
T1 - A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI
AU - Bland, Wesley
AU - Du, Peng
AU - Bouteiller, Aurelien
AU - Herault, Thomas
AU - Bosilca, George
AU - Dongarra, Jack
PY - 2012
Y1 - 2012
N2 - Most predictions of Exascale machines picture billion way parallelism, encompassing not only millions of cores, but also tens of thousands of nodes. Even considering extremely optimistic advances in hardware reliability, probabilistic amplification entails that failures will be unavoidable. Consequently, software fault tolerance is paramount to maintain future scientific productivity. Two major problems hinder ubiquitous adoption of fault tolerance techniques: 1) traditional checkpoint based approaches incur a steep overhead on failure free operations and 2) the dominant programming paradigm for parallel applications (the MPI standard) offers extremely limited support of software-level fault tolerance approaches. In this paper, we present an approach that relies exclusively on the features of a high quality implementation, as defined by the current MPI standard, to enable algorithmic based recovery, without incurring the overhead of customary periodic checkpointing. The validity and performance of this approach are evaluated on large scale systems, using the QR factorization as an example.
AB - Most predictions of Exascale machines picture billion way parallelism, encompassing not only millions of cores, but also tens of thousands of nodes. Even considering extremely optimistic advances in hardware reliability, probabilistic amplification entails that failures will be unavoidable. Consequently, software fault tolerance is paramount to maintain future scientific productivity. Two major problems hinder ubiquitous adoption of fault tolerance techniques: 1) traditional checkpoint based approaches incur a steep overhead on failure free operations and 2) the dominant programming paradigm for parallel applications (the MPI standard) offers extremely limited support of software-level fault tolerance approaches. In this paper, we present an approach that relies exclusively on the features of a high quality implementation, as defined by the current MPI standard, to enable algorithmic based recovery, without incurring the overhead of customary periodic checkpointing. The validity and performance of this approach are evaluated on large scale systems, using the QR factorization as an example.
UR - http://www.scopus.com/inward/record.url?scp=84867633485&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-32820-6_48
DO - 10.1007/978-3-642-32820-6_48
M3 - Conference contribution
AN - SCOPUS:84867633485
SN - 9783642328190
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 477
EP - 488
BT - Parallel Processing - 18th International Conference, Euro-Par 2012, Proceedings
T2 - 18th International Conference on Parallel Processing, Euro-Par 2012
Y2 - 27 August 2012 through 31 August 2012
ER -