Numerical analysis of fixed point algorithms in the presence of hardware faults

Miroslav Stoyanov, Clayton Webster

Research output: Contribution to journalArticlepeer-review

14 Scopus citations

Abstract

The exponential growth of computational power of the extreme scale machines over the past few decades has led to a corresponding decrease in reliability and a sharp increase of the frequency of hardware faults. Our research focuses on the mathematical challenges presented by the silent hardware faults; i.e., faults that can perturb the result of computations in an inconspicuous way. Using the approach of selective reliability, we present an analytic fault mode that can be used to study the resilience properties of a numerical algorithm. We apply our approach to the classical fixed point iteration and demonstrate that in the presence of hardware faults, the classical method fails to converge in expectation. We preset a modified resilient algorithm that detects and rejects faults resulting in error with large magnitude, while small faults are negated by the natural self-correcting properties of the algorithm. We show that our method is convergent (in first and second statistical moments) even in the presence of silent hardware faults.

Original languageEnglish
Pages (from-to)C532-C553
JournalSIAM Journal on Scientific Computing
Volume37
Issue number5
DOIs
StatePublished - 2015

Funding

FundersFunder number
Office of Science
U.S. Department of Energy

    Keywords

    • Fault tolerance
    • Fixed point method
    • Resilience

    Fingerprint

    Dive into the research topics of 'Numerical analysis of fixed point algorithms in the presence of hardware faults'. Together they form a unique fingerprint.

    Cite this