TY - GEN
T1 - Towards practical algorithm based fault tolerance in dense linear algebra
AU - Wu, Panruo
AU - Guan, Qiang
AU - DeBardeleben, Nathan
AU - Blanchard, Sean
AU - Tao, Dingwen
AU - Liang, Xin
AU - Chen, Jieyang
AU - Chen, Zizhong
N1 - Publisher Copyright:
Copyright © 2016 by the Association for Computing Machinery, Inc. (ACM).
PY - 2016/5/31
Y1 - 2016/5/31
N2 - Algorithm based fault tolerance (ABFT) attracts renewed interest for its extremely low overhead and good scalability. However the fault model used to design ABFT has been either abstract, simplistic, or both, leaving a gap between what occurs at the architecture level and what the algorithm expects. As the fault model is the deciding factor in choosing an effective checksum scheme, the resulting ABFT techniques have seen limited impact in practice. In this paper we seek to close the gap by directly using a comprehensive architectural fault model and devise a comprehensive ABFT scheme that can tolerate multiple architectural faults of various kinds. We implement the new ABFT scheme into high performance linpack (HPL) to demonstrate the feasibility in large scale high performance benchmark. We conduct architectural fault injection experiments and large scale experiments to empirically validate its fault tolerance and demonstrate the overhead of error handling, respectively.
AB - Algorithm based fault tolerance (ABFT) attracts renewed interest for its extremely low overhead and good scalability. However the fault model used to design ABFT has been either abstract, simplistic, or both, leaving a gap between what occurs at the architecture level and what the algorithm expects. As the fault model is the deciding factor in choosing an effective checksum scheme, the resulting ABFT techniques have seen limited impact in practice. In this paper we seek to close the gap by directly using a comprehensive architectural fault model and devise a comprehensive ABFT scheme that can tolerate multiple architectural faults of various kinds. We implement the new ABFT scheme into high performance linpack (HPL) to demonstrate the feasibility in large scale high performance benchmark. We conduct architectural fault injection experiments and large scale experiments to empirically validate its fault tolerance and demonstrate the overhead of error handling, respectively.
UR - http://www.scopus.com/inward/record.url?scp=84978524170&partnerID=8YFLogxK
U2 - 10.1145/2907294.2907315
DO - 10.1145/2907294.2907315
M3 - Conference contribution
AN - SCOPUS:84978524170
T3 - HPDC 2016 - Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing
SP - 31
EP - 42
BT - HPDC 2016 - Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing
PB - Association for Computing Machinery, Inc
T2 - 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2016
Y2 - 31 May 2016 through 4 June 2016
ER -