TY - JOUR
T1 - FT-CNN
T2 - Algorithm-Based Fault Tolerance for Convolutional Neural Networks
AU - Zhao, Kai
AU - Di, Sheng
AU - Li, Sihuan
AU - Liang, Xin
AU - Zhai, Yujia
AU - Chen, Jieyang
AU - Ouyang, Kaiming
AU - Cappello, Franck
AU - Chen, Zizhong
N1 - Publisher Copyright:
© 1990-2012 IEEE.
PY - 2021/7/1
Y1 - 2021/7/1
N2 - Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused by high-energy particles, high temperature, or abnormal voltage. Of critical importance is ensuring the stability of the CNN inference process against soft errors. Traditional fault tolerance methods are not suitable for CNN inference because error-correcting code is unable to protect computational components, instruction duplication techniques incur high overhead, and existing algorithm-based fault tolerance (ABFT) techniques cannot protect all convolution implementations. In this article, we focus on how to protect the CNN inference process against soft errors as efficiently as possible, with the following three contributions. (1) We propose several systematic ABFT schemes based on checksum techniques and analyze their fault protection ability and runtime thoroughly. Unlike traditional ABFT based on matrix-matrix multiplication, our schemes support any convolution implementations. (2) We design a novel workflow integrating all the proposed schemes to obtain a high detection/correction ability with limited total runtime overhead. (3) We perform our evaluation using ImageNet with well-known CNN models including AlexNet, VGG-19, ResNet-18, and YOLOv2. Experimental results demonstrate that our implementation can handle soft errors with very limited runtime overhead (4%\sim∼8% in both error-free and error-injected situations).
AB - Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused by high-energy particles, high temperature, or abnormal voltage. Of critical importance is ensuring the stability of the CNN inference process against soft errors. Traditional fault tolerance methods are not suitable for CNN inference because error-correcting code is unable to protect computational components, instruction duplication techniques incur high overhead, and existing algorithm-based fault tolerance (ABFT) techniques cannot protect all convolution implementations. In this article, we focus on how to protect the CNN inference process against soft errors as efficiently as possible, with the following three contributions. (1) We propose several systematic ABFT schemes based on checksum techniques and analyze their fault protection ability and runtime thoroughly. Unlike traditional ABFT based on matrix-matrix multiplication, our schemes support any convolution implementations. (2) We design a novel workflow integrating all the proposed schemes to obtain a high detection/correction ability with limited total runtime overhead. (3) We perform our evaluation using ImageNet with well-known CNN models including AlexNet, VGG-19, ResNet-18, and YOLOv2. Experimental results demonstrate that our implementation can handle soft errors with very limited runtime overhead (4%\sim∼8% in both error-free and error-injected situations).
KW - Algorithm-based fault tolerance
KW - deep learning
KW - high-performance computing
KW - reliability
KW - silent data corruption
UR - http://www.scopus.com/inward/record.url?scp=85099103646&partnerID=8YFLogxK
U2 - 10.1109/TPDS.2020.3043449
DO - 10.1109/TPDS.2020.3043449
M3 - Article
AN - SCOPUS:85099103646
SN - 1045-9219
VL - 32
SP - 1677
EP - 1689
JO - IEEE Transactions on Parallel and Distributed Systems
JF - IEEE Transactions on Parallel and Distributed Systems
IS - 7
M1 - 9311863
ER -