Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach

Dong Li, Zizhong Chen, Panruo Wu, Jeffrey S. Vetter

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

33 Scopus citations

Abstract

Algorithm-based fault tolerance (ABFT) is a highly efficient resilience solution for many widely-used scientific computing kernels. However, in the context of the resilience ecosystem, ABFT is completely opaque to any underlying hardware re-silience mechanisms. As a result, some data structures are over-protected by ABFT and hardware, which leads to re-dundant costs in terms of performance and energy. In this paper, we rethink ABFT using an integrated view including both software and hardware with the goal of improving performance and energy efficiency of ABFT-enabled appli-cations. In particular, we study how to coordinate ABFT and error-correcting code (ECC) for main memory, and in-vestigate the impact of this coordination on performance, energy, and resilience for ABFT-enabled applications. Scaling tests and analysis indicate that our approach saves up to 25% for system energy (and up to 40% for dynamic mem-ory energy) with up to 18% performance improvement over traditional approaches of ABFT with ECC.

Original languageEnglish
Title of host publicationProceedings of SC 2013
Subtitle of host publicationThe International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherIEEE Computer Society
ISBN (Print)9781450323789
DOIs
StatePublished - 2013
Event2013 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2013 - Denver, CO, United States
Duration: Nov 17 2013Nov 22 2013

Publication series

NameInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC
ISSN (Print)2167-4329
ISSN (Electronic)2167-4337

Conference

Conference2013 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2013
Country/TerritoryUnited States
CityDenver, CO
Period11/17/1311/22/13

Funding

FundersFunder number
National Science Foundation#CNS-1304969, #OCI-1305624, #CCF-1305622

    Keywords

    • Adaptive resilience
    • Algorithm-based fault tolerance
    • Error-correcting code

    Fingerprint

    Dive into the research topics of 'Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach'. Together they form a unique fingerprint.

    Cite this