TY - GEN
T1 - Parallel reduction to hessenberg form with algorithm-based fault tolerance
AU - Jia, Yulu
AU - Bosilca, George
AU - Luszczek, Piotr
AU - Dongarra, Jack J.
PY - 2013
Y1 - 2013
N2 - This paper studies the resilience of a two-sided factorization and presents a generic algorithm-based approach capable of making two-sided factorizations resilient. We establish the theoretical proof of the correctness and the numerical stability of the approach in the context of a Hessenberg Reduction (HR) and present the scalability and performance results of a practical implementation. Our method is a hybrid algorithm combining an Algorithm Based Fault Tolerance (ABFT) technique with diskless checkpointing to fully protect the data. We protect the trailing and the initial part of the matrix with checksums, and protect finished panels in the panel scope with diskless checkpoints. Compared with the original HR (the ScaLAPACK PDGEHRD routine) our fault-tolerant algorithm introduces very little overhead, and maintains the same level of scalability. We prove that the overhead shows a decreasing trend as the size of the matrix or the size of the process grid increases.
AB - This paper studies the resilience of a two-sided factorization and presents a generic algorithm-based approach capable of making two-sided factorizations resilient. We establish the theoretical proof of the correctness and the numerical stability of the approach in the context of a Hessenberg Reduction (HR) and present the scalability and performance results of a practical implementation. Our method is a hybrid algorithm combining an Algorithm Based Fault Tolerance (ABFT) technique with diskless checkpointing to fully protect the data. We protect the trailing and the initial part of the matrix with checksums, and protect finished panels in the panel scope with diskless checkpoints. Compared with the original HR (the ScaLAPACK PDGEHRD routine) our fault-tolerant algorithm introduces very little overhead, and maintains the same level of scalability. We prove that the overhead shows a decreasing trend as the size of the matrix or the size of the process grid increases.
KW - Algorithm-based fault tolerance
KW - Dense linear algebra
KW - Hessenberg reduction
KW - Parallel numerical libraries
KW - ScaLAPACK
UR - http://www.scopus.com/inward/record.url?scp=84899699317&partnerID=8YFLogxK
U2 - 10.1145/2503210.2503249
DO - 10.1145/2503210.2503249
M3 - Conference contribution
AN - SCOPUS:84899699317
SN - 9781450323789
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
BT - Proceedings of SC 2013
PB - IEEE Computer Society
T2 - 2013 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2013
Y2 - 17 November 2013 through 22 November 2013
ER -