TY - GEN
T1 - Fault tolerant one-sided matrix decompositions on heterogeneous systems with GPUs
AU - Chen, Jieyang
AU - Li, Hongbo
AU - Li, Sihuan
AU - Liang, Xin
AU - Wu, Panruo
AU - Tao, Dingwen
AU - Ouyang, Kaiming
AU - Liu, Yuanlai
AU - Zhao, Kai
AU - Guan, Qiang
AU - Chen, Zizhong
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/7/2
Y1 - 2018/7/2
N2 - Current algorithm-based fault tolerance (ABFT) approach for one-sided matrix decomposition on heterogeneous systems with GPUs have following limitations: (1) they do not provide sufficient protection as most of them only maintain checksum in one dimension; (2) their checking scheme is not efficient due to redundant checksum verifications; (3) they fail to protect PCIe communication; and (4) the checksum calculation based on a special type of matrix multiplication is far from efficient. By overcoming the above limitations, we design an efficient ABFT approach providing stronger protection for one-sided matrix decomposition methods on heterogeneous systems. First, we provide full matrix protection by using checksums in two dimensions. Second, our checking scheme is more efficient by prioritizing the checksum verification according to the sensitivity of matrix operations to soft errors. Third, we protect PCIe communication by reordering checksum verifications and decomposition steps. Fourth, we accelerate the checksum calculation by 1.7x via better utilizing GPUs.
AB - Current algorithm-based fault tolerance (ABFT) approach for one-sided matrix decomposition on heterogeneous systems with GPUs have following limitations: (1) they do not provide sufficient protection as most of them only maintain checksum in one dimension; (2) their checking scheme is not efficient due to redundant checksum verifications; (3) they fail to protect PCIe communication; and (4) the checksum calculation based on a special type of matrix multiplication is far from efficient. By overcoming the above limitations, we design an efficient ABFT approach providing stronger protection for one-sided matrix decomposition methods on heterogeneous systems. First, we provide full matrix protection by using checksums in two dimensions. Second, our checking scheme is more efficient by prioritizing the checksum verification according to the sensitivity of matrix operations to soft errors. Third, we protect PCIe communication by reordering checksum verifications and decomposition steps. Fourth, we accelerate the checksum calculation by 1.7x via better utilizing GPUs.
KW - Algorithm-based fault tolerance
KW - GPU
KW - Heterogeneous system
KW - Linear algebra
KW - Matrix decomposition
UR - http://www.scopus.com/inward/record.url?scp=85060111707&partnerID=8YFLogxK
U2 - 10.1109/SC.2018.00071
DO - 10.1109/SC.2018.00071
M3 - Conference contribution
AN - SCOPUS:85060111707
T3 - Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
SP - 854
EP - 865
BT - Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
Y2 - 11 November 2018 through 16 November 2018
ER -