Highly scalable self-healing algorithms for high performance scientific computing

Zizhong Chen, Jack Dongarra

Research output: Contribution to journalArticlepeer-review

34 Scopus citations

Abstract

As the number of processors in today's high-performance computers continues to grow, the mean-time-to-failure of these computers is becoming significantly shorter than the execution time of many current high-performance computing applications. Although today's architectures are usually robust enough to survive node failures without suffering complete system failure, most of today's high-performance computing applications cannot survive node failures. Therefore, whenever a node fails, all surviving processes on surviving nodes usually have to be aborted and the whole application has to be restarted. In this paper, we present a framework for building self-healing high-performance numerical computing applications so that they can adapt to node or link failures without aborting themselves. The framework is based on FT-MPI and diskless checkpointing. Our diskless checkpointing uses weighted checksum schemes, a variation of Reed-Solomon erasure codes over floating-point numbers. We introduce several scalable encoding strategies into the existing diskless checkpointing and reduce the overhead to survive k failures in p processes from 2[log p]. k ((β + 2γ) m + α) to (1 + O (√p/√m)) 2. k (β + 2γ) m, where \alpha is the communication latency, 1/β is the network bandwidth between processes, 1/γ is the rate to perform calculations, and m is the size of local checkpoint per process. When additional checkpoint processors are used, the overhead can be reduced to (1 + O ({1/√m)). k (β + 2γ) m, which is independent of the total number of computational processors. The introduced self-healing algorithms are scalable in the sense that the overhead to survive k failures in p processes does not increase as the number of processes p increases. We evaluate the performance overhead of our self-healing approach by using a preconditioned conjugate gradient equation solver as an example. Experimental results demonstrate that our self-healing scheme can survive multiple simultaneous process failures with low-performance overhead and little numerical impact.

Original languageEnglish
Article number4799775
Pages (from-to)1512-1524
Number of pages13
JournalIEEE Transactions on Computers
Volume58
Issue number11
DOIs
StatePublished - 2009
Externally publishedYes

Keywords

  • Diskless checkpointing
  • Fault tolerance
  • High-performance computing
  • Message passing interface
  • Parallel and distributed systems
  • Pipeline
  • Self-healing

Fingerprint

Dive into the research topics of 'Highly scalable self-healing algorithms for high performance scientific computing'. Together they form a unique fingerprint.

Cite this