Abstract
Resilience is considered a challenging under-addressed issue that the high performance computing community (HPC) will have to face in order to produce reliable Exascale systems by the beginning of the next decade. As part of a push toward a resilient HPC ecosystem, in this paper we propose an error-resilient iterative solver for sparse linear systems based on stationary component-wise relaxation methods. Starting from a plain implementation of the Jacobi iteration, our approach introduces a low-cost component-wise technique that detects bit-flips, rejecting some component updates, and turning the initial synchronized solver into an asynchronous iteration. Our experimental study with sparse incomplete factorizations from a collection of real-world applications, and a practical GPU implementation, exposes the convergence delay incurred by the fault-tolerant implementation and its practical performance.
Original language | English |
---|---|
Article number | 100583 |
Journal | Journal of Computational Science |
Volume | 36 |
DOIs | |
State | Published - Sep 2019 |
Funding
This material is based upon work supported in part by the U.S. Department of Energy (Award Number DE-SC-0010042) and NVIDIA. E. S. Quintana-Ortí was supported by project CICYT TIN2014-53495-R of MINECO and FEDER.
Funders | Funder number |
---|---|
U.S. Department of Energy | DE-SC-0010042 |
NVIDIA | CICYT TIN2014-53495-R |
Ministerio de Economía y Competitividad | |
European Regional Development Fund |
Keywords
- Bit flips
- Fault tolerance
- High performance computing
- Iterative solvers
- Jacobi method
- Sparse linear systems