TY - GEN
T1 - High performance dense linear system solver with soft error resilience
AU - Du, Peng
AU - Luszczek, Piotr
AU - Dongarra, Jack
PY - 2011
Y1 - 2011
N2 - As the scale of modern high end computing systems continues to grow rapidly, system failure has become an issue that requires a better solution than the commonly used scheme of checkpoint and restart (C/R). While hard errors have been studied extensively over the years, soft errors are still under-studied especially for modern HPC systems, and in some scientific applications C/R is not applicable for soft error at all due to error propagation and lack of error awareness. In this work, we propose an algorithm based fault tolerance (ABFT) for high performance dense linear system solver with soft error resilience. By adapting a mathematical model that treats soft error during LU factorization as rank-one perturbation, the solution of Ax=b can be recovered with the Sherman-Morrison formula. Our contribution includes extending error model from Gaussian elimination and pair wise pivoting to LU with partial pivoting, and we provide a practical numerical bound for error detection and a scalable check pointing algorithm to protect the left factor that is needed for recovering x from soft error. Experimental results on cluster systems with ScaLAPACK show that the fault tolerance functionality adds little overhead to the linear system solving and scales well on such systems.
AB - As the scale of modern high end computing systems continues to grow rapidly, system failure has become an issue that requires a better solution than the commonly used scheme of checkpoint and restart (C/R). While hard errors have been studied extensively over the years, soft errors are still under-studied especially for modern HPC systems, and in some scientific applications C/R is not applicable for soft error at all due to error propagation and lack of error awareness. In this work, we propose an algorithm based fault tolerance (ABFT) for high performance dense linear system solver with soft error resilience. By adapting a mathematical model that treats soft error during LU factorization as rank-one perturbation, the solution of Ax=b can be recovered with the Sherman-Morrison formula. Our contribution includes extending error model from Gaussian elimination and pair wise pivoting to LU with partial pivoting, and we provide a practical numerical bound for error detection and a scalable check pointing algorithm to protect the left factor that is needed for recovering x from soft error. Experimental results on cluster systems with ScaLAPACK show that the fault tolerance functionality adds little overhead to the linear system solving and scales well on such systems.
UR - http://www.scopus.com/inward/record.url?scp=80955123431&partnerID=8YFLogxK
U2 - 10.1109/CLUSTER.2011.38
DO - 10.1109/CLUSTER.2011.38
M3 - Conference contribution
AN - SCOPUS:80955123431
SN - 9780769545165
T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC
SP - 272
EP - 280
BT - Proceedings - 2011 IEEE International Conference on Cluster Computing, CLUSTER 2011
T2 - 2011 IEEE International Conference on Cluster Computing, CLUSTER 2011
Y2 - 26 September 2011 through 30 September 2011
ER -