Towards practical algorithm based fault tolerance in dense linear algebra

Panruo Wu, Qiang Guan, Nathan DeBardeleben, Sean Blanchard, Dingwen Tao, Xin Liang, Jieyang Chen, Zizhong Chen

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

32 Scopus citations

Abstract

Algorithm based fault tolerance (ABFT) attracts renewed interest for its extremely low overhead and good scalability. However the fault model used to design ABFT has been either abstract, simplistic, or both, leaving a gap between what occurs at the architecture level and what the algorithm expects. As the fault model is the deciding factor in choosing an effective checksum scheme, the resulting ABFT techniques have seen limited impact in practice. In this paper we seek to close the gap by directly using a comprehensive architectural fault model and devise a comprehensive ABFT scheme that can tolerate multiple architectural faults of various kinds. We implement the new ABFT scheme into high performance linpack (HPL) to demonstrate the feasibility in large scale high performance benchmark. We conduct architectural fault injection experiments and large scale experiments to empirically validate its fault tolerance and demonstrate the overhead of error handling, respectively.

Original languageEnglish
Title of host publicationHPDC 2016 - Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing
PublisherAssociation for Computing Machinery, Inc
Pages31-42
Number of pages12
ISBN (Electronic)9781450343145
DOIs
StatePublished - May 31 2016
Externally publishedYes
Event25th ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2016 - Kyoto, Japan
Duration: May 31 2016Jun 4 2016

Publication series

NameHPDC 2016 - Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

Conference

Conference25th ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2016
Country/TerritoryJapan
CityKyoto
Period05/31/1606/4/16

Funding

The authors would like to thank the anonymous reviewers for their insightful comments and valuable suggestions. This work is partially supported by the NSF grants CCF-1305622, ACI-1305624, CCF-1513201, the SZSTI basic research program JCYJ20150630114942313, and the Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund (the second phase).

FundersFunder number
NSFC-Guangdong Joint Fund
SZSTIJCYJ20150630114942313
National Science FoundationACI-1305624, CCF-1305622, CCF-1513201

    Fingerprint

    Dive into the research topics of 'Towards practical algorithm based fault tolerance in dense linear algebra'. Together they form a unique fingerprint.

    Cite this