Abstract
We present a new fault-tolerant algorithm for the problem of computing the connected components of a graph. Our algorithm derives from a highly parallel but non-resilient algorithm, which is based on the technique of label propagation (LP). To make the LP algorithm resilient to transient soft faults, we apply an algorithmic design principle that we refer to as self-correction. Briefly, a self-correcting algorithm detects if it has reached an invalid state given that it was previously in a known valid state; and if so, restores itself back to a valid state assuming the availability of a selective guaranteed-reliable mode. Our self-correcting algorithm, FT-LP, has relatively small storage and computation overheads: in empirical tests on a variety of input graphs, we observe execution time overheads of 10-35% in FT-LP compared to LP even at high fault rates, with the computation overhead increasing gracefully as fault rates increase.
Original language | English |
---|---|
Title of host publication | FTXS 2016 - Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale |
Publisher | Association for Computing Machinery, Inc |
Pages | 9-16 |
Number of pages | 8 |
ISBN (Electronic) | 9781450343497 |
DOIs | |
State | Published - May 31 2016 |
Event | ACM Workshop on Fault-Tolerance for HPC at Extreme Scale, FTXS 2016 - Kyoto, Japan Duration: May 31 2016 → … |
Publication series
Name | FTXS 2016 - Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale |
---|
Conference
Conference | ACM Workshop on Fault-Tolerance for HPC at Extreme Scale, FTXS 2016 |
---|---|
Country/Territory | Japan |
City | Kyoto |
Period | 05/31/16 → … |
Funding
We thank Srinivas Eswar and Aftab Patel for the help in revising this manuscript, including checking of the proofs. This work was supported in part by the U.S. Department of Energy (DOE), Office of Science, Advanced Scientific Computing Research X-Stack 1.0 under DE-FC02-10ER26006/DE-SC0004915. We also used resources of the National Energy Research Scientific Computing Center, which is supported by the DOE Office of Science under Contract No. DE-AC02-05CH11231. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of DOE.
Keywords
- Fault-tolerance
- Label propagation
- Selective reliability
- Self-stabilization
- Transient soft faults