TY - GEN
T1 - Tuning stationary iterative solvers for fault resilience
AU - Anzt, Hartwig
AU - Dongarra, Jack
AU - Quintana-Ortí, Enrique S.
N1 - Publisher Copyright:
© 2015 ACM.
PY - 2015/11/15
Y1 - 2015/11/15
N2 - As the transistor's feature size decreases following Moore's Law, hardware will become more prone to permanent, intermittent, and transient errors, increasing the number of failures experienced by applications, and diminishing the confidence of users. As a result, resilience is considered the most difficult under addressed issue faced by the High Performance Computing community. In this paper, we address the design of error resilient iterative solvers for sparse linear systems. Contrary to most previous approaches, based on Krylov subspace methods, for this purpose we analyze stationary component-wise relaxation. Concretely, starting from a plain implementation of the Jacobi iteration, we design a low-cost component-wise technique that elegantly handles bit-flips, turning the initial synchronized solver into an asynchronous iteration. Our experimental study employs sparse incomplete factorizations from several practical applications to expose the convergence delay incurred by the fault-tolerant implementation.
AB - As the transistor's feature size decreases following Moore's Law, hardware will become more prone to permanent, intermittent, and transient errors, increasing the number of failures experienced by applications, and diminishing the confidence of users. As a result, resilience is considered the most difficult under addressed issue faced by the High Performance Computing community. In this paper, we address the design of error resilient iterative solvers for sparse linear systems. Contrary to most previous approaches, based on Krylov subspace methods, for this purpose we analyze stationary component-wise relaxation. Concretely, starting from a plain implementation of the Jacobi iteration, we design a low-cost component-wise technique that elegantly handles bit-flips, turning the initial synchronized solver into an asynchronous iteration. Our experimental study employs sparse incomplete factorizations from several practical applications to expose the convergence delay incurred by the fault-tolerant implementation.
KW - Fault tolerance
KW - High performance computing
KW - Resilience
KW - Sparse linear systems
KW - Stationary (and asynchronous) iterative solvers
UR - http://www.scopus.com/inward/record.url?scp=84968585953&partnerID=8YFLogxK
U2 - 10.1145/2832080.2832081
DO - 10.1145/2832080.2832081
M3 - Conference contribution
AN - SCOPUS:84968585953
T3 - Proceedings of ScalA 2015: 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - Held in conjunction with SC 2015: The International Conference for High Performance Computing, Networking, Storage and Analysis
BT - Proceedings of ScalA 2015
PB - Association for Computing Machinery, Inc
T2 - 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2015
Y2 - 15 November 2015 through 20 November 2015
ER -