TY - GEN
T1 - Self-stabilizing connected components
AU - Sao, Piyush
AU - Engelmann, Christian
AU - Eswar, Srinivas
AU - Green, Oded
AU - Vuduc, Richard
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/11
Y1 - 2019/11
N2 - For the problem of computing the connected components of a graph, this paper considers the design of algorithms that are resilient to transient hardware faults, like bit flips. More specifically, it applies the technique of\emph{self-stabilization}. A system is self-stabilizing if, when starting from a valid or invalid state, it is guaranteed to reach a valid state after a finite number of steps. Therefore on a machine subject to a transient fault, a self-stabilizing algorithm could recover if that fault caused the system to enter an invalid state. We give a comprehensive analysis of the valid and invalid states during label propagation and derive algorithms to verify and correct the invalid state. The self-stabilizing label-propagation algorithm performs VV additional computation and requires V additional storage over its conventional counterpart (and, as such, does not increase asymptotic complexity over conventional\labelprop). When run against a battery of simulated fault injection tests, the self-stabilizing label propagation algorithm exhibits more resilient behavior than a triple modular redundancy (TMR) based fault-tolerant algorithm in 80% of cases. From a performance perspective, it also outperforms TMR as it requires fewer iterations in total. Beyond the fault-tolerance properties of self-stabilizing label-propagation, we believe, they are useful from the theoretical perspective; and may have other use-cases.
AB - For the problem of computing the connected components of a graph, this paper considers the design of algorithms that are resilient to transient hardware faults, like bit flips. More specifically, it applies the technique of\emph{self-stabilization}. A system is self-stabilizing if, when starting from a valid or invalid state, it is guaranteed to reach a valid state after a finite number of steps. Therefore on a machine subject to a transient fault, a self-stabilizing algorithm could recover if that fault caused the system to enter an invalid state. We give a comprehensive analysis of the valid and invalid states during label propagation and derive algorithms to verify and correct the invalid state. The self-stabilizing label-propagation algorithm performs VV additional computation and requires V additional storage over its conventional counterpart (and, as such, does not increase asymptotic complexity over conventional\labelprop). When run against a battery of simulated fault injection tests, the self-stabilizing label propagation algorithm exhibits more resilient behavior than a triple modular redundancy (TMR) based fault-tolerant algorithm in 80% of cases. From a performance perspective, it also outperforms TMR as it requires fewer iterations in total. Beyond the fault-tolerance properties of self-stabilizing label-propagation, we believe, they are useful from the theoretical perspective; and may have other use-cases.
KW - Algorithm-based-Fault-Tolerance
KW - Connected-components
KW - Fault-tolerance
KW - Label-propagation
KW - Self-stabilization
UR - http://www.scopus.com/inward/record.url?scp=85078876694&partnerID=8YFLogxK
U2 - 10.1109/FTXS49593.2019.00011
DO - 10.1109/FTXS49593.2019.00011
M3 - Conference contribution
AN - SCOPUS:85078876694
T3 - Proceedings of FTXS 2019: Fault Tolerance for HPC at eXtreme Scale Workshop - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis
SP - 50
EP - 59
BT - Proceedings of FTXS 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 9th IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS 2019
Y2 - 22 November 2019
ER -