Tuning stationary iterative solvers for fault resilience

Hartwig Anzt, Jack Dongarra, Enrique S. Quintana-Ortí

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

11 Scopus citations

Abstract

As the transistor's feature size decreases following Moore's Law, hardware will become more prone to permanent, intermittent, and transient errors, increasing the number of failures experienced by applications, and diminishing the confidence of users. As a result, resilience is considered the most difficult under addressed issue faced by the High Performance Computing community. In this paper, we address the design of error resilient iterative solvers for sparse linear systems. Contrary to most previous approaches, based on Krylov subspace methods, for this purpose we analyze stationary component-wise relaxation. Concretely, starting from a plain implementation of the Jacobi iteration, we design a low-cost component-wise technique that elegantly handles bit-flips, turning the initial synchronized solver into an asynchronous iteration. Our experimental study employs sparse incomplete factorizations from several practical applications to expose the convergence delay incurred by the fault-tolerant implementation.

Original languageEnglish
Title of host publicationProceedings of ScalA 2015
Subtitle of host publication6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - Held in conjunction with SC 2015: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherAssociation for Computing Machinery, Inc
ISBN (Electronic)9781450340113
DOIs
StatePublished - Nov 15 2015
Externally publishedYes
Event6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2015 - Austin, United States
Duration: Nov 15 2015Nov 20 2015

Publication series

NameProceedings of ScalA 2015: 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - Held in conjunction with SC 2015: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2015
Country/TerritoryUnited States
CityAustin
Period11/15/1511/20/15

Funding

This work was partly funded by the U.S. Department of Energy (Award Number DE-SC-0010042), and the Russian Scientific Foundation (Agreement N14-11-00190). E. S. Quintana-Ortí was supported by projects TIN2011-23283 and TIN2014-53495-R of the Spanish Ministerio de Economía y Competitividad.

FundersFunder number
U.S. Department of EnergyDE-SC-0010042
Ministerio de Economía y Competitividad
Russian Science FoundationTIN2014-53495-R, TIN2011-23283, N14-11-00190

    Keywords

    • Fault tolerance
    • High performance computing
    • Resilience
    • Sparse linear systems
    • Stationary (and asynchronous) iterative solvers

    Fingerprint

    Dive into the research topics of 'Tuning stationary iterative solvers for fault resilience'. Together they form a unique fingerprint.

    Cite this