Algorithm-based diskless checkpointing for fault tolerant matrix operation

James S. Plank, Youngbae Kim, Jack J. Dongarra

Research output: Contribution to journalConference articlepeer-review

39 Scopus citations

Abstract

This paper is an exploration of diskless check-pointing for distributed scientific computations. With the widespread use of the `Network Of Workstation' (NOW) platform for distributed computing, long-running scientific computations need to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several algorithms for distributed scientific computing, including Cholesky factorization, LU factorization, QR factorization, and Preconditioned Conjugate Gradient. These implementations are able to run on PVM networks of at least N processors, and can complete with low overhead as long as any N processors remain functional. We discuss the details of how the algorithms are tuned for fault-tolerance, and present the performance results on a PVM network of SUN workstations, and on the IBM SP2.

Original languageEnglish
Pages (from-to)351-360
Number of pages10
JournalProceedings - Annual International Conference on Fault-Tolerant Computing
StatePublished - 1995
Externally publishedYes
EventProceedings of the 25th International Symposium on Fault-Tolerant Computing - Pasadena, CA, USA
Duration: Jun 27 1995Jun 30 1995

Fingerprint

Dive into the research topics of 'Algorithm-based diskless checkpointing for fault tolerant matrix operation'. Together they form a unique fingerprint.

Cite this