Fault tolerant high performance computing by a coding approach

Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun, George Bosilca, Jack Dongarra

Research output: Contribution to conferencePaperpeer-review

84 Scopus citations

Abstract

As the number of processors in today's high performance computers continues to grow, the mean-time-to-failure of these computers are becoming significantly shorter than the execution time of many current high performance computing applications. Although today's architectures are usually robust enough to survive node failures without suffering complete system failure, most today's high performance computing applications can not survive node failures and, therefore, whenever a node fails, have to abort themselves and restart from the beginning or a stable-storage-based checkpoint. This paper explores the use of the floating-point arithmetic coding approach to build fault survivable high performance computing applications so that they can adapt to node failures without aborting themselves. Despite the use of erasure codes over Galois field has been theoretically attempted before in diskless checkpointing, few actual implementations exist. This probably derives from concerns related to both the efficiency and the complexity of implementing such codes in high performance computing applications. In this paper, we introduce the simple but efficient floating-point arithmetic coding approach into diskless checkpointing and address the associated round-off error issue. We also implement a floating-point arithmetic version of the Reed-Solomon coding scheme into a conjugate gradient equation solver and evaluate both the performance and the numerical impact of this scheme. Experimental results demonstrate that the proposed floating-point arithmetic coding approach is able to survive a small number of simultaneous node failures with low performance overhead and little numerical impact.

Original languageEnglish
Pages213-223
Number of pages11
DOIs
StatePublished - 2005
Externally publishedYes
Event2005 ACM SIGPLAN Symposium on Principles and Practise of Parallel Programming, PROPP 05 - Chicago, IL, United States
Duration: Jun 15 2005Jun 17 2005

Conference

Conference2005 ACM SIGPLAN Symposium on Principles and Practise of Parallel Programming, PROPP 05
Country/TerritoryUnited States
CityChicago, IL
Period06/15/0506/17/05

Keywords

  • Fault Tolerance
  • Floating-Point Arithmetic Coding
  • High Performance Computing
  • Message Passing Interface

Fingerprint

Dive into the research topics of 'Fault tolerant high performance computing by a coding approach'. Together they form a unique fingerprint.

Cite this