Fault-tolerant matrix operations for networks of workstations using diskless checkpointing

James S. Plank, Youngbae Kim, Jack J. Dongarra

Research output: Contribution to journalArticlepeer-review

37 Scopus citations

Abstract

Networks of workstations (NOWs) offer a cost-effective platform for high-performance, long-running parallel computations. However, these computations must be able to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several fault-tolerant algorithms for distributed scientific computing. The fault-tolerance is based on diskless checkpointing, a paradigm that uses processor redundancy rather than stable storage as the fault-tolerant medium. These algorithms are able to run on clusters of workstations that change over time due to failure, load, or availability. As long as there are at leastnprocessors in the cluster, and failures occur singly, the computation will complete in an efficient manner. We discuss the details of how the algorithms are tuned for fault-tolerance and present the performance results on a PVM network of Sun workstations connected by a fast, switched ethernet.

Original languageEnglish
Pages (from-to)125-138
Number of pages14
JournalJournal of Parallel and Distributed Computing
Volume43
Issue number2
DOIs
StatePublished - Jun 15 1997

Funding

James Plank was supported by National Science Foundation Grant CCR-9409496 and the ORAU Junior Faculty Enhancement Award. Jack Dongarra is supported by the Defense Advanced Research Projects Agency under Contract DAAL03-91-C-0047, administered by the Army Research Office, by the Office of Scientific Computing, U.S. Department of Energy, under Contract DE-AC05-84OR21400, and by the National Science Foundation Science and Technology Center Cooperative Agreement CCR-8809615.

Fingerprint

Dive into the research topics of 'Fault-tolerant matrix operations for networks of workstations using diskless checkpointing'. Together they form a unique fingerprint.

Cite this