A self-correcting connected components algorithm

Piyush Sao, Oded Green, Chirag Jain, Richard Vuduc

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

We present a new fault-tolerant algorithm for the problem of computing the connected components of a graph. Our algorithm derives from a highly parallel but non-resilient algorithm, which is based on the technique of label propagation (LP). To make the LP algorithm resilient to transient soft faults, we apply an algorithmic design principle that we refer to as self-correction. Briefly, a self-correcting algorithm detects if it has reached an invalid state given that it was previously in a known valid state; and if so, restores itself back to a valid state assuming the availability of a selective guaranteed-reliable mode. Our self-correcting algorithm, FT-LP, has relatively small storage and computation overheads: in empirical tests on a variety of input graphs, we observe execution time overheads of 10-35% in FT-LP compared to LP even at high fault rates, with the computation overhead increasing gracefully as fault rates increase.

Original languageEnglish
Title of host publicationFTXS 2016 - Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale
PublisherAssociation for Computing Machinery, Inc
Pages9-16
Number of pages8
ISBN (Electronic)9781450343497
DOIs
StatePublished - May 31 2016
Externally publishedYes
EventACM Workshop on Fault-Tolerance for HPC at Extreme Scale, FTXS 2016 - Kyoto, Japan
Duration: May 31 2016 → …

Publication series

NameFTXS 2016 - Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale

Conference

ConferenceACM Workshop on Fault-Tolerance for HPC at Extreme Scale, FTXS 2016
Country/TerritoryJapan
CityKyoto
Period05/31/16 → …

Funding

We thank Srinivas Eswar and Aftab Patel for the help in revising this manuscript, including checking of the proofs. This work was supported in part by the U.S. Department of Energy (DOE), Office of Science, Advanced Scientific Computing Research X-Stack 1.0 under DE-FC02-10ER26006/DE-SC0004915. We also used resources of the National Energy Research Scientific Computing Center, which is supported by the DOE Office of Science under Contract No. DE-AC02-05CH11231. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of DOE.

FundersFunder number
U.S. Department of Energy
Office of ScienceDE-AC02-05CH11231
Advanced Scientific Computing ResearchDE-FC02-10ER26006/DE-SC0004915

    Keywords

    • Fault-tolerance
    • Label propagation
    • Selective reliability
    • Self-stabilization
    • Transient soft faults

    Fingerprint

    Dive into the research topics of 'A self-correcting connected components algorithm'. Together they form a unique fingerprint.

    Cite this