Self-stabilizing connected components

Piyush Sao, Christian Engelmann, Srinivas Eswar, Oded Green, Richard Vuduc

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

For the problem of computing the connected components of a graph, this paper considers the design of algorithms that are resilient to transient hardware faults, like bit flips. More specifically, it applies the technique of\emph{self-stabilization}. A system is self-stabilizing if, when starting from a valid or invalid state, it is guaranteed to reach a valid state after a finite number of steps. Therefore on a machine subject to a transient fault, a self-stabilizing algorithm could recover if that fault caused the system to enter an invalid state. We give a comprehensive analysis of the valid and invalid states during label propagation and derive algorithms to verify and correct the invalid state. The self-stabilizing label-propagation algorithm performs VV additional computation and requires V additional storage over its conventional counterpart (and, as such, does not increase asymptotic complexity over conventional\labelprop). When run against a battery of simulated fault injection tests, the self-stabilizing label propagation algorithm exhibits more resilient behavior than a triple modular redundancy (TMR) based fault-tolerant algorithm in 80% of cases. From a performance perspective, it also outperforms TMR as it requires fewer iterations in total. Beyond the fault-tolerance properties of self-stabilizing label-propagation, we believe, they are useful from the theoretical perspective; and may have other use-cases.

Original languageEnglish
Title of host publicationProceedings of FTXS 2019
Subtitle of host publicationFault Tolerance for HPC at eXtreme Scale Workshop - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages50-59
Number of pages10
ISBN (Electronic)9781728160139
DOIs
StatePublished - Nov 2019
Event9th IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS 2019 - Denver, United States
Duration: Nov 22 2019 → …

Publication series

NameProceedings of FTXS 2019: Fault Tolerance for HPC at eXtreme Scale Workshop - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference9th IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS 2019
Country/TerritoryUnited States
CityDenver
Period11/22/19 → …

Funding

VIII. ACKNOWLEDGEMENT This work is supported by the Early Career Research Program of the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, with program managers Lucy Nowell and Robinson Pino, under contract number DE-AC05-00OR22725. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

FundersFunder number
U.S. Department of Energy
Office of Science
Advanced Scientific Computing ResearchDE-AC05-00OR22725

    Keywords

    • Algorithm-based-Fault-Tolerance
    • Connected-components
    • Fault-tolerance
    • Label-propagation
    • Self-stabilization

    Fingerprint

    Dive into the research topics of 'Self-stabilizing connected components'. Together they form a unique fingerprint.

    Cite this