TY - GEN
T1 - Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources
AU - Chen, Zizhong
AU - Dongarra, Jack
PY - 2006
Y1 - 2006
N2 - As the size of today's high performance computers increases from hundreds, to thousands, and even tens of thousands of processors, node failures in these computers are becoming frequent events. Although checkpoint/rollback-recovery is the typical technique to tolerate such failures, it often introduces a considerable overhead. Algorithm-based fault tolerance is a very cost-effective method to incorporate fault tolerance into matrix computations. However, previous algorithm-based fault tolerance methods for matrix computations are often derived using algorithms that are seldomly used in the practice of today's high performance matrix computations and have mostly focused on platforms where failed processors produce incorrect calculations. To fill this gap, this paper extends the existing algorithm-based fault tolerance to the volatile computing platform where the failied processor stops working and applies it to scalable high performance matrix computations with two dimensional block cyclic data distribution. We show the practicality of this technique by applying it to the ScaLAPACK/PBLAS matrix-matrix multiplication kernel. Experimental results demonstrate that the proposed approach is able to survive process failures with a very low performance overhead.
AB - As the size of today's high performance computers increases from hundreds, to thousands, and even tens of thousands of processors, node failures in these computers are becoming frequent events. Although checkpoint/rollback-recovery is the typical technique to tolerate such failures, it often introduces a considerable overhead. Algorithm-based fault tolerance is a very cost-effective method to incorporate fault tolerance into matrix computations. However, previous algorithm-based fault tolerance methods for matrix computations are often derived using algorithms that are seldomly used in the practice of today's high performance matrix computations and have mostly focused on platforms where failed processors produce incorrect calculations. To fill this gap, this paper extends the existing algorithm-based fault tolerance to the volatile computing platform where the failied processor stops working and applies it to scalable high performance matrix computations with two dimensional block cyclic data distribution. We show the practicality of this technique by applying it to the ScaLAPACK/PBLAS matrix-matrix multiplication kernel. Experimental results demonstrate that the proposed approach is able to survive process failures with a very low performance overhead.
UR - http://www.scopus.com/inward/record.url?scp=33847240498&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2006.1639333
DO - 10.1109/IPDPS.2006.1639333
M3 - Conference contribution
AN - SCOPUS:33847240498
SN - 1424400546
SN - 9781424400546
T3 - 20th International Parallel and Distributed Processing Symposium, IPDPS 2006
BT - 20th International Parallel and Distributed Processing Symposium, IPDPS 2006
PB - IEEE Computer Society
T2 - 20th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2006
Y2 - 25 April 2006 through 29 April 2006
ER -