TY - GEN
T1 - Hessenberg reduction with transient error resilience on GPU-based hybrid architectures
AU - Jia, Yulu
AU - Luszczek, Piotr
AU - Dongarra, Jack
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/7/18
Y1 - 2016/7/18
N2 - Graphics Processing Units (GPUs) have been seeing widespread adoption in thefield of scientific computing, owing to the performance gains provided oncomputation-intensive applications. In this paper, we present the design andimplementation of a Hessenberg reduction algorithm immune to simultaneoussoft-errors, capable of taking advantage of hybrid GPU-CPU platforms. Thesesoft-errors are detected and corrected on the fly, preventing the propagationof the error to the rest of the data. Our design is at the intersection betweenseveral fault tolerant techniques and employs the algorithm-based faulttolerance technique, diskless checkpointing, and reverse computation to achieveits goal. By utilizing the idle time of the CPUs, and by overlapping bothhost-side and GPU-side workloads, we minimize the resilience overhead. Experimental results have validated our design decisions as our algorithmintroduced less than 2% performance overhead compared to the optimized, butfault-prone, hybrid Hessenberg reduction.
AB - Graphics Processing Units (GPUs) have been seeing widespread adoption in thefield of scientific computing, owing to the performance gains provided oncomputation-intensive applications. In this paper, we present the design andimplementation of a Hessenberg reduction algorithm immune to simultaneoussoft-errors, capable of taking advantage of hybrid GPU-CPU platforms. Thesesoft-errors are detected and corrected on the fly, preventing the propagationof the error to the rest of the data. Our design is at the intersection betweenseveral fault tolerant techniques and employs the algorithm-based faulttolerance technique, diskless checkpointing, and reverse computation to achieveits goal. By utilizing the idle time of the CPUs, and by overlapping bothhost-side and GPU-side workloads, we minimize the resilience overhead. Experimental results have validated our design decisions as our algorithmintroduced less than 2% performance overhead compared to the optimized, butfault-prone, hybrid Hessenberg reduction.
KW - Fault-tolerance
KW - GPGPU
KW - Hessenberg reduction
KW - Similarity transformation
UR - http://www.scopus.com/inward/record.url?scp=84991706285&partnerID=8YFLogxK
U2 - 10.1109/IPDPSW.2016.34
DO - 10.1109/IPDPSW.2016.34
M3 - Conference contribution
AN - SCOPUS:84991706285
T3 - Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016
SP - 653
EP - 662
BT - Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 30th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2016
Y2 - 23 May 2016 through 27 May 2016
ER -