Fine-grained bit-flip protection for relaxation methods

Hartwig Anzt, Jack Dongarra, Enrique S. Quintana-Ortí

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

Resilience is considered a challenging under-addressed issue that the high performance computing community (HPC) will have to face in order to produce reliable Exascale systems by the beginning of the next decade. As part of a push toward a resilient HPC ecosystem, in this paper we propose an error-resilient iterative solver for sparse linear systems based on stationary component-wise relaxation methods. Starting from a plain implementation of the Jacobi iteration, our approach introduces a low-cost component-wise technique that detects bit-flips, rejecting some component updates, and turning the initial synchronized solver into an asynchronous iteration. Our experimental study with sparse incomplete factorizations from a collection of real-world applications, and a practical GPU implementation, exposes the convergence delay incurred by the fault-tolerant implementation and its practical performance.

Original languageEnglish
Article number100583
JournalJournal of Computational Science
Volume36
DOIs
StatePublished - Sep 2019

Funding

This material is based upon work supported in part by the U.S. Department of Energy (Award Number DE-SC-0010042) and NVIDIA. E. S. Quintana-Ortí was supported by project CICYT TIN2014-53495-R of MINECO and FEDER.

FundersFunder number
U.S. Department of EnergyDE-SC-0010042
NVIDIACICYT TIN2014-53495-R
Ministerio de Economía y Competitividad
European Regional Development Fund

    Keywords

    • Bit flips
    • Fault tolerance
    • High performance computing
    • Iterative solvers
    • Jacobi method
    • Sparse linear systems

    Fingerprint

    Dive into the research topics of 'Fine-grained bit-flip protection for relaxation methods'. Together they form a unique fingerprint.

    Cite this