TY - GEN
T1 - A self-correcting connected components algorithm
AU - Sao, Piyush
AU - Green, Oded
AU - Jain, Chirag
AU - Vuduc, Richard
N1 - Publisher Copyright:
© 2016 ACM.
PY - 2016/5/31
Y1 - 2016/5/31
N2 - We present a new fault-tolerant algorithm for the problem of computing the connected components of a graph. Our algorithm derives from a highly parallel but non-resilient algorithm, which is based on the technique of label propagation (LP). To make the LP algorithm resilient to transient soft faults, we apply an algorithmic design principle that we refer to as self-correction. Briefly, a self-correcting algorithm detects if it has reached an invalid state given that it was previously in a known valid state; and if so, restores itself back to a valid state assuming the availability of a selective guaranteed-reliable mode. Our self-correcting algorithm, FT-LP, has relatively small storage and computation overheads: in empirical tests on a variety of input graphs, we observe execution time overheads of 10-35% in FT-LP compared to LP even at high fault rates, with the computation overhead increasing gracefully as fault rates increase.
AB - We present a new fault-tolerant algorithm for the problem of computing the connected components of a graph. Our algorithm derives from a highly parallel but non-resilient algorithm, which is based on the technique of label propagation (LP). To make the LP algorithm resilient to transient soft faults, we apply an algorithmic design principle that we refer to as self-correction. Briefly, a self-correcting algorithm detects if it has reached an invalid state given that it was previously in a known valid state; and if so, restores itself back to a valid state assuming the availability of a selective guaranteed-reliable mode. Our self-correcting algorithm, FT-LP, has relatively small storage and computation overheads: in empirical tests on a variety of input graphs, we observe execution time overheads of 10-35% in FT-LP compared to LP even at high fault rates, with the computation overhead increasing gracefully as fault rates increase.
KW - Fault-tolerance
KW - Label propagation
KW - Selective reliability
KW - Self-stabilization
KW - Transient soft faults
UR - http://www.scopus.com/inward/record.url?scp=84978525713&partnerID=8YFLogxK
U2 - 10.1145/2909428.2909435
DO - 10.1145/2909428.2909435
M3 - Conference contribution
AN - SCOPUS:84978525713
T3 - FTXS 2016 - Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale
SP - 9
EP - 16
BT - FTXS 2016 - Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale
PB - Association for Computing Machinery, Inc
T2 - ACM Workshop on Fault-Tolerance for HPC at Extreme Scale, FTXS 2016
Y2 - 31 May 2016
ER -