Algorithm-based fault tolerance for fail-stop failures

Zizhong Chen, Jack Dongarra

Research output: Contribution to journalArticlepeer-review

102 Scopus citations

Abstract

Fail-stop failures in distributed environments are often tolerated by checkpointing or message logging. In this paper, we show that fail-stop process failures in ScaLAPACK matrix-matrix multiplication kennel can be tolerated without checkpointing or message logging. It has been proved in previous algorithm-based fault tolerance that, for matrix-matrix multiplication, the checksum relationship in the input checksum matrices is preserved at the end of the computation no mater which algorithm is chosen. From this checksum relationship in the final computation results, processor miscalculations can be detected, located, and corrected at the end of the computation. However, whether this checksum relationship can be maintained in the middle of the computation or not remains open. In this paper, we first demonstrate that, for many matrix matrix multiplication algorithms, the checksum relationship in the input checksum matrices is not maintained in the middle of the computation. We then prove that, however, for the outer product version algorithm, the checksum relationship in the input checksum matrices can be maintained in the middle of the computation. Based on this checksum relationship maintained in the middle of the computation, we demonstrate that fail-stop process failures (which are often tolerated by checkpointing or message logging) in ScaLAPACK matrix-matrix multiplication can be tolerated without checkpointing or message logging.

Original languageEnglish
Pages (from-to)1628-1641
Number of pages14
JournalIEEE Transactions on Parallel and Distributed Systems
Volume19
Issue number12
DOIs
StatePublished - 2008
Externally publishedYes

Funding

The authors would like to thank the anonymous reviewers for their valuable comments. This research was supported in part by the Los Alamos National Laboratory under Contract 03891-001-99 49 and the Applied Mathematical Sciences Research Program of the Office of Mathematical, Information, and Computational Sciences, US Department of Energy under Contract DE-AC05-00OR22725 with UT-Battelle, LLC.

FundersFunder number
U.S. Department of EnergyDE-AC05-00OR22725
Los Alamos National Laboratory03891-001-99 49

    Keywords

    • Algorithm-based fault tolerance
    • Checkpointing
    • Fail-stop failures
    • Parallel matrix-matrix multiplication
    • ScaLAPACK

    Fingerprint

    Dive into the research topics of 'Algorithm-based fault tolerance for fail-stop failures'. Together they form a unique fingerprint.

    Cite this