Fault tolerant matrix operations for networks of workstations using multiple checkpointing

Youngbae Kim, James S. Plank, Jack J. Dongarra

Research output: Contribution to conferencePaperpeer-review

7 Scopus citations

Abstract

Recently, an algorithm-based approach using diskless checkpointing has been developed to provide fault tolerance for high-performance matrix operations. With this approach, fault tolerance is incorporated into the matrix operations, making them resilient to any single process failure with low overhead. In this paper, we present a technique called multiple checkpointing that enables the matrix operations to tolerate a certain set of multiple processor failures by adding multiple checkpointing processors. Results of implementing this technique on a network of workstations show improvement in both the reliability of the computation and the performance of checkpointing.

Original languageEnglish
Pages460-465
Number of pages6
StatePublished - 1997
Externally publishedYes
EventProceedings of the 1997 2nd High Performance Computing on the Information Superhighway, HPC Asia'97 - Seoul, South Korea
Duration: Apr 28 1997May 2 1997

Conference

ConferenceProceedings of the 1997 2nd High Performance Computing on the Information Superhighway, HPC Asia'97
CitySeoul, South Korea
Period04/28/9705/2/97

Fingerprint

Dive into the research topics of 'Fault tolerant matrix operations for networks of workstations using multiple checkpointing'. Together they form a unique fingerprint.

Cite this