Abstract
Recently, an algorithm-based approach using diskless checkpointing has been developed to provide fault tolerance for high-performance matrix operations. With this approach, fault tolerance is incorporated into the matrix operations, making them resilient to any single process failure with low overhead. In this paper, we present a technique called multiple checkpointing that enables the matrix operations to tolerate a certain set of multiple processor failures by adding multiple checkpointing processors. Results of implementing this technique on a network of workstations show improvement in both the reliability of the computation and the performance of checkpointing.
Original language | English |
---|---|
Pages | 460-465 |
Number of pages | 6 |
State | Published - 1997 |
Externally published | Yes |
Event | Proceedings of the 1997 2nd High Performance Computing on the Information Superhighway, HPC Asia'97 - Seoul, South Korea Duration: Apr 28 1997 → May 2 1997 |
Conference
Conference | Proceedings of the 1997 2nd High Performance Computing on the Information Superhighway, HPC Asia'97 |
---|---|
City | Seoul, South Korea |
Period | 04/28/97 → 05/2/97 |