Fault tolerant matrix operations using checksum and reverse computation

  • Y. Kim
  • , J. S. Plank
  • , J. J. Dongarra

Research output: Contribution to conferencePaperpeer-review

13 Scopus citations

Abstract

In this paper, we present a technique, based on checksum and reverse computation, that enables high-performance matrix operations to be fault-tolerant with low overhead. We have implemented this technique on five matrix operations: matrix multiplication, Cholesky factorization, LV factorization, QR factorization and Hessenberg reduction. The overhead of checkpointing and recovery is analyzed both theoretically and experimentally. These analyses confirm that our technique can provide fault tolerance for these high-performance matrix operations with low overhead.

Original languageEnglish
Pages70-77
Number of pages8
StatePublished - 1996
Externally publishedYes
EventProceedings of the 1996 6th Symposium on the Frontiers of Massively Parallel Computing, Frontiers'96 - Annapolis, MD, USA
Duration: Oct 27 1996Oct 31 1996

Conference

ConferenceProceedings of the 1996 6th Symposium on the Frontiers of Massively Parallel Computing, Frontiers'96
CityAnnapolis, MD, USA
Period10/27/9610/31/96

Fingerprint

Dive into the research topics of 'Fault tolerant matrix operations using checksum and reverse computation'. Together they form a unique fingerprint.

Cite this